Learning Soft Constraints From Constrained Expert Demonstrations

An algorithm that captures the soft constraints with some expert demonstrations.

Ashish Gaurav, Kasra Rezaee, Guiliang Liu, Pascal Poupart

Abstract

For the setting where the reward function is given and the costraints are unknown, the paper purposes a method to recover constraints satisfactorily from the expert data. Different from previous attempts to recover the hard constraits, the purposed method is capable of capturing cumulative soft constraints, where the agents satisfies on average per epsidoe. This problem can be solved by iteratively adjusting the constraint function through a constrained optimization procedure until the agent behavior matches the expert behavior.

Introduction

Inverse Constraint Learning (ICL)

ICL is defined as the task of extracting the implicit constraint functions associated with the given {near}-optimal behaviors. This work propsoed a novel method to learn the cumulative soft constarints from expert demonstrations, assuming the reward function is given. The different between hard constraints and soft constraints are

Contributions

The contributions of this paper can be summarized as follows:

  1. A novel fomulation and method for ICL - works with any state-action spaces.
  2. First to learn cumulative soft constraints - taking into account noise and possible violations in expert demonstrations

Background

Markov Decision Process

For Markov Decision Process (MDP), see here.

Reinforcement Learning

The objective of standard RL procedure is to obtain an optimal policy that maximizes the expected long-term discounted reward

π=arg maxπEs0μ(),atπ(st),st+1p(st,at)[t=0γtr(st,at)]Jμπ(r) \pi^* = \argmax_\pi \mathbb{E}_{s_0 \sim \mu(\cdot), a_t \sim \pi(\cdot|s_t), s_{t+1} \sim p(\cdot|s_t,a_t)}\left[ \sum^\infty_{t=0} \gamma^t r(s_t,a_t)\right] \eqqcolon J^\pi_\mu(r)

Constrained Reinforcement Learning

Constrained Reinforcement Learning’s objective function is similar to standard RL’s, with an addition that the expectation of cumulative constraint functions $c_i$ must not exceed associated thresholds $\beta_i$

π=arg maxπJμπ(r) such that Jμπ(ci)βii \pi^* = \argmax_\pi J^\pi_\mu(r) \text{ such that } J^\pi_\mu(c_i) \leq \beta_i \forall i

For simplicity, this work only considers constrained RL with only one constraint function.

Majority of the previous works only consider the hard constaints, which is fine in deterministic domains. However, the probability of violating a constraint can rarely be reduced to zero in a stochastic domain. A group of researchers have attempted to extend the framework of maimum entropy inverse reinforcement learning, learning probabilistic constraints that hold with high probability in expert demonstrations. In the above approach, a constraint value is treated as a negative reward. Since the probability of a trajectory is proportional to the exponential of the rewards, this effectively reduces the probability of trajectories with high constraint values.There are many other approaches such as Bayesian approaches; however, none of them correspond to soft constraints.

Approach

Inverse Constraint Learning: Two Phases

Given a reward rr and demonstrations D\mathcal{D}, we want to output a constraint function cc s.t. obtained optimal policy π\pi^* explains the behavior in D\mathcal{D}.

The approach is based on the template of IRL, alternating between

We will first show a theoretical approach, followed with the adaptation to a practical problem.

The theoretical approach starts with an empty set of policies Π=\Pi = \empty and grows this set by alternating until convergence

Policy Opimization: πarg maxπJμπ(r) s.t. Jμπ(c)β and ΠΠ{π} (1)Constraint Adjustment: carg maxcminπΠJμπ(c) s.t. JμπE(c)β (2) \text{Policy Opimization: } \pi^* \coloneqq \argmax_\pi J^\pi_\mu (r) \text{ s.t. } J^\pi_\mu (c) \leq \beta \text{ and } \Pi \leftarrow \Pi \cup\{\pi^*\} ~{}(1) \\ \text{Constraint Adjustment: } c^* \coloneqq \argmax_c \min_{\pi \in \Pi} J^\pi_\mu(c) \text{ s.t. } J^{\pi_E}_\mu (c) \leq \beta ~{} (2)

Equation (1) performs forward constraint RL to find optimal policy π\pi^* for constraint cc. Then, this optimal policy π\pi^* is added to set of policies Π\Pi.

This is followed by equation (2) which adjusts the constraint function cc to increase the constraint values in the policies in Π\Pi. This is done while keeping constraint value for πE\pi_E bounded by β\beta. In summary, the overarching idea behind equation (2) is to maximizing the accumulated constraint value for the most feasible policy in Π\Pi.

To summarize this procedure, for each iteration, a new policy is found but is then made infeasible by increasing its constraint value past threshold β\beta. By doing so, the policy will eventually converge to the expert policy. Intuitively, this is because all policies and trajectories except the expert’s are infeasible in the long run. A rigorous proof can be seen in Theorem 1.

Theorem 1: Assuming there is a unique policy πE\pi^E that achieves JμπE(r)J^{\pi^E}_\mu (r), the alternation of optimization procedures in (1) and (2) converges to a set of policies Π\Pi such that the last policy π\pi^* added to Π\Pi is equivalent to the expert policy πE\pi^E in the sense that π\pi^* and πE\pi^E generate the same trajectories.

The proof of theorem 1 can be found here.

However, in practice there are challenges when implementing equation (1) and (2):

Instead, the paper uses an approximation approach by replacing (2) with a simpler optimization

carg maxJμπmix(c) such that JμπE(c)β (3) c^* \coloneqq \argmax J^{\pi_{mix}}_\mu(c) \text{ such that } J^{\pi_E}_\mu(c) \leq \beta ~{} (3)

Instead of the max-min optimization of the constraint values of policies in Π\Pi, the paper implement a maximization of the constraint value of the mixture πmix\pi_{mix} of policies in Π\Pi, where the mixture is a collectino of optimal policies each with a weight. Even though this modification loses the theoretical guarantee of convergence, the paper still finds in experiments that the algo converges empirically.

Solving the Constrained Opimitzations through the Penalty Method

Note that equation (1), (2), and (3) all belong to the following general class of optimization problems:

minyf(y) such that g(y)0 (4) \min_y f(y) \text{ such that } g(y) \leq 0 ~{} (4)

Most apporaches formulate this problem from a Lagrangian perspective as a min-max problem that can be solved by gradient ascent-descent type algorithms. However, Lagrangian formulations are challenging in terms of empirical convergence, suffering from oscillatory behaviors.

Therefore, the paper investigates the penalty method, which converts a constrained problem into an unconstrained problem with a non differentiable ReLU.

After instantiating y=y0y = y_0. the rough repeating procedure is as follows

  1. Finds a feasible solution by repeated nudging yy in the direction of g(y)- \nabla g(y) until g(y)0g(y) \leq 0
  2. Optimizes a soft loss that simultaneously minimizes f(y)f(y) while keeping yy within feasible region

minyLsoft(y)f(y)+λReLU(g(y))\min_y L_{\text{soft}}(y) \coloneqq f(y) + \lambda \text{ReLU} (g(y))

Compared to Lagrangian based approahces, the penalty method has the advantage of

Improving Constraint Adjustment

Constraint Adjustment in another words is finding the decision boundary between the expert and agent.

We can further improve constraint adjustment by noticing that an overlap in expert and agent data used in computing the terms JμπE(c)J^{\pi_E}_\mu(c) and Jμπmix(c)J^{\pi_{mix}}_\mu(c), which hampers convergence.

Consider the soft loss objective for (3)

mincLsoft(c)Jμπmix(c)+λReLU(JμπE(c)β) \min_c L_{\text{soft}}(c) \coloneqq - J^{\pi_{mix}}_\mu (c) + \lambda \text{ReLU}(J^{\pi_E}_\mu (c) - \beta)

Note that the soft loss vanishes depending on whether $J^{\pi_E}_\mu(c) - \beta \leq 0$ or not. There are two cases:

  1. If JμπE(c)β0J^{\pi_E}_\mu(c) - \beta \leq 0

    Assume there are some expert-{like} trajectories in agent data, we will end up increasing the constraint value across these trajectories.

    This behavior is undesirable as it leads to cc being more infeasible.

  2. If JμπEβ>0J^{\pi_E}_\mu - \beta > 0

    Note that there is a non-zero ReLU term.

    Assume there are some expert-{like} trajectories in agent data, we take the gradient of Lsoft(c)L_{\text{soft}}(c). Then, we would get two contrasting gradient terms, trying to increase and decrease the constraint value across the same expert trajectories.

    Therefore, it diminishes the effect of ReLU, requiring more iterations needed for convergence of cc.

To solve the aforementioned issues, the paper propose two reweightings:

  1. Reweight the policies in πmix\pi_{mix}

    Policies dissimlar to the expert policy are favoured more in calculation of Jμπmix(c)-J^{\pi_{mix}}_\mu(c)

  2. Reweight the individual trajectories in the expectation Jμπmix(c)-J^{\pi_{mix}}_\mu(c)

    Less or negligible weight associated with the expert or expert-like trajectories

Both reweightings can be performed using a density estimator. The idea is to learn the desntiy of expert-{like} state-action pairs and computes the negative log-probability (NLP) of any given trajectory’s state-action pair to determine whether it is expert-{like} or not. Practically, estimated by computING the mean and std. deviation of the NLP of expert state-action paris Afterwards duing test time, we can check if given NLP is within one std. deviation of the mean or not.