Uppder Confidence Bound

I derive the Hoeffding's Bound and analyze the UCB1 algorithm.

Simon Li

Upper Confidence Bound

To understand UCB1, we would need to use Hoeffding’s Inequality.

Hoeffding’s Inequality

Theorem: Let Θ1,,ΘN\Theta_1, \dots, \Theta_N be a sequence of i.i.d. random variables with mean E[Θ]\mathbb{E}[\Theta] and range [a,b][a,b]. Then, for any ϵ>0\epsilon > 0, we have that

Pr[1Nn=1NΘNE[Θ]ϵ]2e2Nϵ2/(ba)2 \Pr\left[ \left| \frac{1}{N} \sum^N_{n=1} \Theta_N - \mathbb{E}[\Theta] \right| \geq \epsilon \right] \leq 2e^{-2N\epsilon^2 / (b-a)^2}

To prove this, we would need to use Hoeffding’s Lemma

Hoeffding’s Lemma: For any random variable XX, with E[X]=0\mathbb{E}[X] = 0 and X[a,b]X \in [a,b], we have

E[esX]e18s2(ba)2 for any s0 \mathbb{E}[e^{sX}] \leq e^{\frac{1}{8}s^2(b-a)^2} \text{ for any } s \geq 0

Proof of Hoeffding Inequality:

Assume that E[Θ]=0\mathbb{E}[\Theta] = 0 and that Θn[a,b]\Theta_n \in [a,b]. We will only show that

Pr[1Nn=1NΘnϵ]e2Nϵ2/(ba)2 \Pr\left[ \frac{1}{N} \sum^N_{n=1} \Theta_n \geq \epsilon \right] \leq e^{-2N\epsilon^2/(b-a)^2}

as it is similar to show that

Pr[1Nn=1NΘnϵ]e2Nϵ2/(ba)2 \Pr\left[ \frac{1}{N} \sum^N_{n=1} \Theta_n \leq -\epsilon \right] \leq e^{-2N\epsilon^2/(b-a)^2}

Observe that for any $s \geq 0$, we have that

Pr[1Nn=1NΘnϵ]=Pr[s1Nn=1NΘnsϵ]=Pr[es1Nn=1NΘnesϵ]E[es1Nn=1NΘn]esϵby Markov’s Inequality=Πn=1NE[esΘnN]esϵr.v. Θn are ind.=E[esΘnN]Nesϵr.v. Θn are i.d.es2(ba)2/(8N)esϵby Hoeffding lemma \begin{aligned} \Pr\left[ \frac{1}{N} \sum^N_{n=1} \Theta_n \geq \epsilon \right] &= \Pr\left[ s\frac{1}{N}\sum^N_{n=1}\Theta_n \geq s\epsilon \right] \\ &= \Pr\left[ e^{s\frac{1}{N}\sum^N_{n=1}\Theta_n} \geq e^{s\epsilon} \right] \\ &\leq \mathbb{E}\left[ e^{s\frac{1}{N}\sum^N_{n=1}\Theta_n} \right] e^{-s\epsilon} && \text{by Markov's Inequality} \\ &=\Pi^N_{n=1}\mathbb{E}\left[e^{\frac{s\Theta_n}{N}}\right]e^{-s\epsilon} && \text{r.v. $\Theta_n$ are ind.} \\ &=\mathbb{E}\left[e^{\frac{s\Theta_n}{N}}\right]^Ne^{-s\epsilon} && \text{r.v. $\Theta_n$ are i.d.} \\ &\leq e^{s^2(b-a)^2/(8N)}e^{-s\epsilon} && \text{by Hoeffding lemma} \end{aligned}

In particular, for s=4Nϵ(ba)2s = \frac{4N\epsilon}{(b-a)^2}, we have that

Pr[1Nn=1NΘnϵ]e2Nϵ2/(ba)2 \Pr\left[\frac{1}{N}\sum^N_{n=1}\Theta_n \geq \epsilon \right] \leq e^{-2N\epsilon^2/(b-a)^2}

as desired. \blacksquare

UCB1

The UCB1 is a function that converts a set of average rewards at trial $t$ into a set of decision values, which are then used to determine which bandit machine to play. The gist of the UCB1 algorithm is as follows.

For problems such as multi-armed bandits, we would like to maintain confidence bounds on the reward r(a)r(a) of each action $a$ based on observations

Pr[LCB(a)<r(a)<UCB(a)]1δ \Pr[LCB(a) < r(a) < UCB(a)] \geq 1 - \delta

Using the confidence bound, we can successively eliminate the actions.

For example, suppose we have two actions a1a_1 and a2a_2. We will alternate playing a1a_1 and a2a_2 until UCBT(a2)<LCBT(a1)UCB_T(a_2) < LCB_T(a_1) or UCBT(a1)<LCBT(a2)UCB_T(a_1) < LCB_T(a_2). For our case, assume that we achieved UCBT(a2)<LCBT(a1)UCB_T(a_2) < LCB_T(a_1). Then, we eliminate a2a_2 since the probability of a2a_2 being the optimal policy is less than 2δ2\delta, which we can manipulate to be small.

To analyze the efficacy of the UCB1 algorithm, we would like to express its results in terms of the cumulative regret LTL_T, which can be defined as

LT=t=1Tlt=maxat=1Trt(a)t=1Trt(π) L_T = \sum^T_{t=1}l_t = \max_a\sum^T_{t=1}r_t(a) - \sum^T_{t=1}r_t(\pi)

where regret $l$ = difference in reward between

Let’s consider the Hoeffding’s Inequality for random XX scaled to [0..1][0..1] with tt samples in total and Nt(a)N_t(a) samples for actino aa. Then, in order to have thet bound Pr[Q>Qt+u]δ\Pr[Q^* > Q_t + u] \leq \delta, we need to have

u(a)=logδ2Nt(a)=2logtNt(a) u(a) = \sqrt{\frac{-\log\delta}{2N_t(a)}} = \sqrt{\frac{2\log t}{N_t(a)}}

where the second equality holds for δ=t4\delta = t^{-4}. The UCB1 algorithm assumes actual reward is close to the upper bound and selects the best action according to Qt(a)+u(a)Q_t(a) + u(a).

Then,

Theorem the expected cumulative regret for UCB1 algorithm is

limtLt8logtaΔa \lim_{t\to\infty} L_t \leq \frac{8 \log t}{\sum_a \Delta a}

where Δa=maxar(a)r(a)\Delta a = max_{a^\prime} r(a^\prime) - r(a).

Some useful notes: https://webdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/slides_ucb.pdf