Theorem 1 Proof

Simon Li

Setting

The alternating procedure is as follows:

Policy Opimization: πarg maxπJμπ(r) s.t. Jμπ(c)β and ΠΠ{π} (1)Constraint Adjustment: carg maxcminπΠJμπ(c) s.t. JμπE(c)β (2) \text{Policy Opimization: } \pi^* \coloneqq \argmax_\pi J^\pi_\mu (r) \text{ s.t. } J^\pi_\mu (c) \leq \beta \text{ and } \Pi \leftarrow \Pi \cup\{\pi^*\} ~{}(1) \\ \text{Constraint Adjustment: } c^* \coloneqq \argmax_c \min_{\pi \in \Pi} J^\pi_\mu(c) \text{ s.t. } J^{\pi_E}_\mu (c) \leq \beta ~{} (2)

And the theorem 1 is as follows:

Theorem 1: Assuming there is a unique policy πE\pi^E that achieves JμπE(r)J^{\pi^E}_\mu (r), the alternation of optimization procedures in (1) and (2) converges to a set of policies Π\Pi such that the last policy π\pi^* added to Π\Pi is equivalent to the expert policy πE\pi^E in the sense that π\pi^* and πE\pi^E generate the same trajectories.

Proof of Theorem 1

We wish to prove by contradiction under 3 assumptions:

  1. the reward function is non-negative
  2. the space of policies is finite
  3. there is a unique policy πE\pi^E that achieves JμπE(r)J^{\pi^E}_\mu (r). In order words, no other policy achieves the same expected cumulative reward as πE\pi^E

Note:

Let π\pi^* be the optimal policy found and cc^* be the optimal constraint function found

By (1), JμπE(c)βJ^{\pi^E}_\mu (c^*) \leq \beta.

By (2), we also have that

maxcminπΠJμπ(c) subject to JμπE(c)β=minπΠJμπ(c)β \begin{aligned} &\max_c \min_{\pi \in \Pi} J^\pi_\mu(c) \text{ subject to } J^{\pi^E}_\mu (c) \leq \beta \\ &= \min_{\pi \in \Pi} J^\pi_\mu (c^*) \\ &\leq \beta \end{aligned}

For the sake of contradiction, assume that π\pi^* is not equivalent to πE\pi^E. Then, we want to show a contradiction that the objective in Equation (2) is greater than β\beta.

maxcminπΠJμπ(c) such that JμπE(c)βminπΠJμπ(c^)c^(s,a)=r(s,a)βJμπE(r)=minπΠJμπ(r)β/JμπE(r)by substituting c^>JμπE(r)β/JμπE(r)=βsince Jμπ(r)JμπE(r) πΠ by (1)and Jμπ(r)JμπE(r) by assumption (iii) \begin{aligned} & \max_c \min_{\pi \in \Pi} J^\pi_\mu(c) \text{ such that } J^{\pi^E}_\mu (c) \leq \beta && \\ &\geq \min_{\pi \in \Pi} J^\pi_\mu(\hat{c}) && \hat{c}(s,a) = \frac{r(s,a)\beta}{J^{\pi^E}_\mu(r)} \\ &= \min_{\pi \in \Pi} J^\pi_\mu(r)\beta / J^{\pi^E}_\mu(r) && \text{by substituting $\hat{c}$} \\ &> J^{\pi^E}_\mu(r)\beta/J^{\pi^E}_\mu(r) = \beta && \text{since $J^\pi_\mu(r) \geq J^{\pi^E}_\mu(r) ~{} \forall\pi\in\Pi$ by (1)} \\ & && \text{and $J^\pi_\mu(r) \neq J^{\pi^E}_\mu(r)$ by assumption (iii)} \end{aligned}

The key step is choosing a specific constraint function c^\hat{c}. Intuitively, constraint functions are selected to be parallel to the reward function while JμπE(c)βJ^{\pi^E}_\mu(c) \leq \beta.

We know that all policies in Π\Pi achieve higher expected cumulative rewards than πE\pi^E,