Theorem 1 Proof

Simon Li

Published

10 March 2024

Setting

The alternating procedure is as follows:

$\text{Policy Opimization: } \pi^* \coloneqq \argmax_\pi J^\pi_\mu (r) \text{ s.t. } J^\pi_\mu (c) \leq \beta \text{ and } \Pi \leftarrow \Pi \cup\{\pi^*\} ~{}(1) \\ \text{Constraint Adjustment: } c^* \coloneqq \argmax_c \min_{\pi \in \Pi} J^\pi_\mu(c) \text{ s.t. } J^{\pi_E}_\mu (c) \leq \beta ~{} (2)$

And the theorem 1 is as follows:

Theorem 1: Assuming there is a unique policy $\pi^E$ that achieves $J^{\pi^E}_\mu (r)$ , the alternation of optimization procedures in (1) and (2) converges to a set of policies $\Pi$ such that the last policy $\pi^*$ added to $\Pi$ is equivalent to the expert policy $\pi^E$ in the sense that $\pi^*$ and $\pi^E$ generate the same trajectories.

Proof of Theorem 1

We wish to prove by contradiction under 3 assumptions:

the reward function is non-negative
the space of policies is finite
there is a unique policy $\pi^E$ that achieves $J^{\pi^E}_\mu (r)$ . In order words, no other policy achieves the same expected cumulative reward as $\pi^E$

Note:

first assumption is not restrictive as we can simply shift the reward function by constant
second assumption is reasonable as parameters have a finite precision → policies also finite

Let $\pi^*$ be the optimal policy found and $c^*$ be the optimal constraint function found

By (1), $J^{\pi^E}_\mu (c^*) \leq \beta$ .

By (2), we also have that

$\begin{aligned} &\max_c \min_{\pi \in \Pi} J^\pi_\mu(c) \text{ subject to } J^{\pi^E}_\mu (c) \leq \beta \\ &= \min_{\pi \in \Pi} J^\pi_\mu (c^*) \\ &\leq \beta \end{aligned}$

For the sake of contradiction, assume that $\pi^*$ is not equivalent to $\pi^E$ . Then, we want to show a contradiction that the objective in Equation (2) is greater than $\beta$ .

$\begin{aligned} & \max_c \min_{\pi \in \Pi} J^\pi_\mu(c) \text{ such that } J^{\pi^E}_\mu (c) \leq \beta && \\ &\geq \min_{\pi \in \Pi} J^\pi_\mu(\hat{c}) && \hat{c}(s,a) = \frac{r(s,a)\beta}{J^{\pi^E}_\mu(r)} \\ &= \min_{\pi \in \Pi} J^\pi_\mu(r)\beta / J^{\pi^E}_\mu(r) && \text{by substituting $\hat{c}$} \\ &> J^{\pi^E}_\mu(r)\beta/J^{\pi^E}_\mu(r) = \beta && \text{since $J^\pi_\mu(r) \geq J^{\pi^E}_\mu(r) ~{} \forall\pi\in\Pi$ by (1)} \\ & && \text{and $J^\pi_\mu(r) \neq J^{\pi^E}_\mu(r)$ by assumption (iii)} \end{aligned}$

The key step is choosing a specific constraint function $\hat{c}$ . Intuitively, constraint functions are selected to be parallel to the reward function while $J^{\pi^E}_\mu(c) \leq \beta$ .

We know that all policies in $\Pi$ achieve higher expected cumulative rewards than $\pi^E$ ,