Setting
The alternating procedure is as follows:
Policy Opimization: π∗:=πargmaxJμπ(r) s.t. Jμπ(c)≤β and Π←Π∪{π∗} (1)Constraint Adjustment: c∗:=cargmaxπ∈ΠminJμπ(c) s.t. JμπE(c)≤β (2)
And the theorem 1 is as follows:
Theorem 1: Assuming there is a unique policy πE that achieves JμπE(r), the alternation of optimization procedures in (1) and (2) converges to a set of policies Π such that the last policy π∗ added to Π is equivalent to the expert policy πE in the sense that π∗ and πE generate the same trajectories.
Proof of Theorem 1
We wish to prove by contradiction under 3 assumptions:
- the reward function is non-negative
- the space of policies is finite
- there is a unique policy πE that achieves JμπE(r). In order words, no other policy achieves the same expected cumulative reward as πE
Note:
- first assumption is not restrictive as we can simply shift the reward function by constant
- second assumption is reasonable as parameters have a finite precision → policies also finite
Let π∗ be the optimal policy found and c∗ be the optimal constraint function found
By (1), JμπE(c∗)≤β.
By (2), we also have that
cmaxπ∈ΠminJμπ(c) subject to JμπE(c)≤β=π∈ΠminJμπ(c∗)≤β
For the sake of contradiction, assume that π∗ is not equivalent to πE. Then, we want to show a contradiction that the objective in Equation (2) is greater than β.
cmaxπ∈ΠminJμπ(c) such that JμπE(c)≤β≥π∈ΠminJμπ(c^)=π∈ΠminJμπ(r)β/JμπE(r)>JμπE(r)β/JμπE(r)=βc^(s,a)=JμπE(r)r(s,a)βby substituting c^since Jμπ(r)≥JμπE(r) ∀π∈Π by (1)and Jμπ(r)=JμπE(r) by assumption (iii)
The key step is choosing a specific constraint function c^. Intuitively, constraint functions are selected to be parallel to the reward function while JμπE(c)≤β.
We know that all policies in Π achieve higher expected cumulative rewards than πE,