To prove this, we would need to use Hoeffding’s Lemma
Hoeffding’s Lemma: For any random variable X, with E[X]=0 and X∈[a,b], we have
E[esX]≤e81s2(b−a)2 for any s≥0
Proof of Hoeffding Inequality:
Assume that E[Θ]=0 and that Θn∈[a,b]. We will only show that
Pr[N1n=1∑NΘn≥ϵ]≤e−2Nϵ2/(b−a)2
as it is similar to show that
Pr[N1n=1∑NΘn≤−ϵ]≤e−2Nϵ2/(b−a)2
Observe that for any $s \geq 0$, we have that
Pr[N1n=1∑NΘn≥ϵ]=Pr[sN1n=1∑NΘn≥sϵ]=Pr[esN1∑n=1NΘn≥esϵ]≤E[esN1∑n=1NΘn]e−sϵ=Πn=1NE[eNsΘn]e−sϵ=E[eNsΘn]Ne−sϵ≤es2(b−a)2/(8N)e−sϵby Markov’s Inequalityr.v. Θn are ind.r.v. Θn are i.d.by Hoeffding lemma
In particular, for s=(b−a)24Nϵ, we have that
Pr[N1n=1∑NΘn≥ϵ]≤e−2Nϵ2/(b−a)2
as desired. ■
UCB1
The UCB1 is a function that converts a set of average rewards at trial $t$ into a set of decision values, which are then used to determine which bandit machine to play. The gist of the UCB1 algorithm is as follows.
For problems such as multi-armed bandits, we would like to maintain confidence bounds on the reward r(a) of each action $a$ based on observations
Pr[LCB(a)<r(a)<UCB(a)]≥1−δ
Using the confidence bound, we can successively eliminate the actions.
For example, suppose we have two actions a1 and a2. We will alternate playing a1 and a2 until UCBT(a2)<LCBT(a1) or UCBT(a1)<LCBT(a2). For our case, assume that we achieved UCBT(a2)<LCBT(a1). Then, we eliminate a2 since the probability of a2 being the optimal policy is less than 2δ, which we can manipulate to be small.
To analyze the efficacy of the UCB1 algorithm, we would like to express its results in terms of the cumulative regretLT, which can be defined as
LT=t=1∑Tlt=amaxt=1∑Trt(a)−t=1∑Trt(π)
where regret $l$ = difference in reward between
action $a^*$ taken by the best static policy
action taken by the agent’s policy $\pi$
Let’s consider the Hoeffding’s Inequality for random X scaled to [0..1] with t samples in total and Nt(a) samples for actino a. Then, in order to have thet bound Pr[Q∗>Qt+u]≤δ, we need to have
u(a)=2Nt(a)−logδ=Nt(a)2logt
where the second equality holds for δ=t−4. The UCB1 algorithm assumes actual reward is close to the upper bound and selects the best action according to Qt(a)+u(a).
Then,
Theorem the expected cumulative regret for UCB1 algorithm is