Uppder Confidence Bound

I derive the Hoeffding's Bound and analyze the UCB1 algorithm.

Simon Li

Published

25 February 2024

Upper Confidence Bound

To understand UCB1, we would need to use Hoeffding’s Inequality.

Hoeffding’s Inequality

Theorem: Let $\Theta_1, \dots, \Theta_N$ be a sequence of i.i.d. random variables with mean $\mathbb{E}[\Theta]$ and range $[a,b]$ . Then, for any $\epsilon > 0$ , we have that

$\Pr\left[ \left| \frac{1}{N} \sum^N_{n=1} \Theta_N - \mathbb{E}[\Theta] \right| \geq \epsilon \right] \leq 2e^{-2N\epsilon^2 / (b-a)^2}$

To prove this, we would need to use Hoeffding’s Lemma

Hoeffding’s Lemma: For any random variable $X$ , with $\mathbb{E}[X] = 0$ and $X \in [a,b]$ , we have

$\mathbb{E}[e^{sX}] \leq e^{\frac{1}{8}s^2(b-a)^2} \text{ for any } s \geq 0$

Proof of Hoeffding Inequality:

Assume that $\mathbb{E}[\Theta] = 0$ and that $\Theta_n \in [a,b]$ . We will only show that

$\Pr\left[ \frac{1}{N} \sum^N_{n=1} \Theta_n \geq \epsilon \right] \leq e^{-2N\epsilon^2/(b-a)^2}$

as it is similar to show that

$\Pr\left[ \frac{1}{N} \sum^N_{n=1} \Theta_n \leq -\epsilon \right] \leq e^{-2N\epsilon^2/(b-a)^2}$

Observe that for any $s \geq 0$, we have that

$\begin{aligned} \Pr\left[ \frac{1}{N} \sum^N_{n=1} \Theta_n \geq \epsilon \right] &= \Pr\left[ s\frac{1}{N}\sum^N_{n=1}\Theta_n \geq s\epsilon \right] \\ &= \Pr\left[ e^{s\frac{1}{N}\sum^N_{n=1}\Theta_n} \geq e^{s\epsilon} \right] \\ &\leq \mathbb{E}\left[ e^{s\frac{1}{N}\sum^N_{n=1}\Theta_n} \right] e^{-s\epsilon} && \text{by Markov's Inequality} \\ &=\Pi^N_{n=1}\mathbb{E}\left[e^{\frac{s\Theta_n}{N}}\right]e^{-s\epsilon} && \text{r.v. $\Theta_n$ are ind.} \\ &=\mathbb{E}\left[e^{\frac{s\Theta_n}{N}}\right]^Ne^{-s\epsilon} && \text{r.v. $\Theta_n$ are i.d.} \\ &\leq e^{s^2(b-a)^2/(8N)}e^{-s\epsilon} && \text{by Hoeffding lemma} \end{aligned}$

In particular, for $s = \frac{4N\epsilon}{(b-a)^2}$ , we have that

$\Pr\left[\frac{1}{N}\sum^N_{n=1}\Theta_n \geq \epsilon \right] \leq e^{-2N\epsilon^2/(b-a)^2}$

as desired. $\blacksquare$

UCB1

The UCB1 is a function that converts a set of average rewards at trial $t$ into a set of decision values, which are then used to determine which bandit machine to play. The gist of the UCB1 algorithm is as follows.

For problems such as multi-armed bandits, we would like to maintain confidence bounds on the reward $r(a)$ of each action $a$ based on observations

$\Pr[LCB(a) < r(a) < UCB(a)] \geq 1 - \delta$

Using the confidence bound, we can successively eliminate the actions.

For example, suppose we have two actions $a_1$ and $a_2$ . We will alternate playing $a_1$ and $a_2$ until $UCB_T(a_2) < LCB_T(a_1)$ or $UCB_T(a_1) < LCB_T(a_2)$ . For our case, assume that we achieved $UCB_T(a_2) < LCB_T(a_1)$ . Then, we eliminate $a_2$ since the probability of $a_2$ being the optimal policy is less than $2\delta$ , which we can manipulate to be small.

To analyze the efficacy of the UCB1 algorithm, we would like to express its results in terms of the cumulative regret $L_T$ , which can be defined as

$L_T = \sum^T_{t=1}l_t = \max_a\sum^T_{t=1}r_t(a) - \sum^T_{t=1}r_t(\pi)$

where regret $l$ = difference in reward between

action $a^*$ taken by the best static policy
action taken by the agent’s policy $\pi$

Let’s consider the Hoeffding’s Inequality for random $X$ scaled to $[0..1]$ with $t$ samples in total and $N_t(a)$ samples for actino $a$ . Then, in order to have thet bound $\Pr[Q^* > Q_t + u] \leq \delta$ , we need to have

$u(a) = \sqrt{\frac{-\log\delta}{2N_t(a)}} = \sqrt{\frac{2\log t}{N_t(a)}}$

where the second equality holds for $\delta = t^{-4}$ . The UCB1 algorithm assumes actual reward is close to the upper bound and selects the best action according to $Q_t(a) + u(a)$ .

Then,

Theorem the expected cumulative regret for UCB1 algorithm is

$\lim_{t\to\infty} L_t \leq \frac{8 \log t}{\sum_a \Delta a}$

where $\Delta a = max_{a^\prime} r(a^\prime) - r(a)$ .

Some useful notes: https://webdocs.cs.ualberta.ca/~games/go/seminar/notes/2007/slides_ucb.pdf