Learning Soft Constraints From Constrained Expert Demonstrations

An algorithm that captures the soft constraints with some expert demonstrations.

Ashish Gaurav, Kasra Rezaee, Guiliang Liu, Pascal Poupart

Published

22 February 2024

Abstract

For the setting where the reward function is given and the costraints are unknown, the paper purposes a method to recover constraints satisfactorily from the expert data. Different from previous attempts to recover the hard constraits, the purposed method is capable of capturing cumulative soft constraints, where the agents satisfies on average per epsidoe. This problem can be solved by iteratively adjusting the constraint function through a constrained optimization procedure until the agent behavior matches the expert behavior.

Introduction

Inverse Constraint Learning (ICL)

ICL is defined as the task of extracting the implicit constraint functions associated with the given {near}-optimal behaviors. This work propsoed a novel method to learn the cumulative soft constarints from expert demonstrations, assuming the reward function is given. The different between hard constraints and soft constraints are

Hard Constraints - the constraint is always satisfied for any individual trajectory
Soft Constraints - the constraints are not necessarily satisfied in every trajectory, but rather satisfied in expectation. Therefore, there may be certain trajectories when the constraint is violated, but in expectation, it is satisfied.

Contributions

The contributions of this paper can be summarized as follows:

A novel fomulation and method for ICL - works with any state-action spaces.
First to learn cumulative soft constraints - taking into account noise and possible violations in expert demonstrations

Background

Markov Decision Process

For Markov Decision Process (MDP), see here.

Reinforcement Learning

The objective of standard RL procedure is to obtain an optimal policy that maximizes the expected long-term discounted reward

$\pi^* = \argmax_\pi \mathbb{E}_{s_0 \sim \mu(\cdot), a_t \sim \pi(\cdot|s_t), s_{t+1} \sim p(\cdot|s_t,a_t)}\left[ \sum^\infty_{t=0} \gamma^t r(s_t,a_t)\right] \eqqcolon J^\pi_\mu(r)$

Constrained Reinforcement Learning

Constrained Reinforcement Learning’s objective function is similar to standard RL’s, with an addition that the expectation of cumulative constraint functions $c_i$ must not exceed associated thresholds $\beta_i$

$\pi^* = \argmax_\pi J^\pi_\mu(r) \text{ such that } J^\pi_\mu(c_i) \leq \beta_i \forall i$

For simplicity, this work only considers constrained RL with only one constraint function.

Majority of the previous works only consider the hard constaints, which is fine in deterministic domains. However, the probability of violating a constraint can rarely be reduced to zero in a stochastic domain. A group of researchers have attempted to extend the framework of maimum entropy inverse reinforcement learning, learning probabilistic constraints that hold with high probability in expert demonstrations. In the above approach, a constraint value is treated as a negative reward. Since the probability of a trajectory is proportional to the exponential of the rewards, this effectively reduces the probability of trajectories with high constraint values.There are many other approaches such as Bayesian approaches; however, none of them correspond to soft constraints.

Approach

Inverse Constraint Learning: Two Phases

Given a reward $r$ and demonstrations $\mathcal{D}$ , we want to output a constraint function $c$ s.t. obtained optimal policy $\pi^*$ explains the behavior in $\mathcal{D}$ .

The approach is based on the template of IRL, alternating between

policy optimization phase
reward adjustment phase

We will first show a theoretical approach, followed with the adaptation to a practical problem.

The theoretical approach starts with an empty set of policies $\Pi = \empty$ and grows this set by alternating until convergence

$\text{Policy Opimization: } \pi^* \coloneqq \argmax_\pi J^\pi_\mu (r) \text{ s.t. } J^\pi_\mu (c) \leq \beta \text{ and } \Pi \leftarrow \Pi \cup\{\pi^*\} ~{}(1) \\ \text{Constraint Adjustment: } c^* \coloneqq \argmax_c \min_{\pi \in \Pi} J^\pi_\mu(c) \text{ s.t. } J^{\pi_E}_\mu (c) \leq \beta ~{} (2)$

Equation (1) performs forward constraint RL to find optimal policy $\pi^*$ for constraint $c$ . Then, this optimal policy $\pi^*$ is added to set of policies $\Pi$ .

This is followed by equation (2) which adjusts the constraint function $c$ to increase the constraint values in the policies in $\Pi$ . This is done while keeping constraint value for $\pi_E$ bounded by $\beta$ . In summary, the overarching idea behind equation (2) is to maximizing the accumulated constraint value for the most feasible policy in $\Pi$ .

To summarize this procedure, for each iteration, a new policy is found but is then made infeasible by increasing its constraint value past threshold $\beta$ . By doing so, the policy will eventually converge to the expert policy. Intuitively, this is because all policies and trajectories except the expert’s are infeasible in the long run. A rigorous proof can be seen in Theorem 1.

Theorem 1: Assuming there is a unique policy $\pi^E$ that achieves $J^{\pi^E}_\mu (r)$ , the alternation of optimization procedures in (1) and (2) converges to a set of policies $\Pi$ such that the last policy $\pi^*$ added to $\Pi$ is equivalent to the expert policy $\pi^E$ in the sense that $\pi^*$ and $\pi^E$ generate the same trajectories.

The proof of theorem 1 can be found here.

However, in practice there are challenges when implementing equation (1) and (2):

Do not have expert policy $\pi^E$ but rather expert trajectories based on the expert policy
$\Pi$ could grow very large before convergence is achieved
Convergence is not guaranteed or may converge prematurely

Instead, the paper uses an approximation approach by replacing (2) with a simpler optimization

$c^* \coloneqq \argmax J^{\pi_{mix}}_\mu(c) \text{ such that } J^{\pi_E}_\mu(c) \leq \beta ~{} (3)$

Instead of the max-min optimization of the constraint values of policies in $\Pi$ , the paper implement a maximization of the constraint value of the mixture $\pi_{mix}$ of policies in $\Pi$ , where the mixture is a collectino of optimal policies each with a weight. Even though this modification loses the theoretical guarantee of convergence, the paper still finds in experiments that the algo converges empirically.

Solving the Constrained Opimitzations through the Penalty Method

Note that equation (1), (2), and (3) all belong to the following general class of optimization problems:

$\min_y f(y) \text{ such that } g(y) \leq 0 ~{} (4)$

Most apporaches formulate this problem from a Lagrangian perspective as a min-max problem that can be solved by gradient ascent-descent type algorithms. However, Lagrangian formulations are challenging in terms of empirical convergence, suffering from oscillatory behaviors.

Therefore, the paper investigates the penalty method, which converts a constrained problem into an unconstrained problem with a non differentiable ReLU.

After instantiating $y = y_0$ . the rough repeating procedure is as follows

Finds a feasible solution by repeated nudging $y$ in the direction of $- \nabla g(y)$ until $g(y) \leq 0$
Optimizes a soft loss that simultaneously minimizes $f(y)$ while keeping $y$ within feasible region

$\min_y L_{\text{soft}}(y) \coloneqq f(y) + \lambda \text{ReLU} (g(y))$

Compared to Lagrangian based approahces, the penalty method has the advantage of

Simpler algorithmic implementation
Performing well empirically

Improving Constraint Adjustment

Constraint Adjustment in another words is finding the decision boundary between the expert and agent.

We can further improve constraint adjustment by noticing that an overlap in expert and agent data used in computing the terms $J^{\pi_E}_\mu(c)$ and $J^{\pi_{mix}}_\mu(c)$ , which hampers convergence.

Consider the soft loss objective for (3)

$\min_c L_{\text{soft}}(c) \coloneqq - J^{\pi_{mix}}_\mu (c) + \lambda \text{ReLU}(J^{\pi_E}_\mu (c) - \beta)$

Note that the soft loss vanishes depending on whether $J^{\pi_E}_\mu(c) - \beta \leq 0$ or not. There are two cases:

If $J^{\pi_E}_\mu(c) - \beta \leq 0$

Assume there are some expert-{like} trajectories in agent data, we will end up increasing the constraint value across these trajectories.

This behavior is undesirable as it leads to $c$ being more infeasible.
If $J^{\pi_E}_\mu - \beta > 0$

Note that there is a non-zero ReLU term.

Assume there are some expert-{like} trajectories in agent data, we take the gradient of $L_{\text{soft}}(c)$ . Then, we would get two contrasting gradient terms, trying to increase and decrease the constraint value across the same expert trajectories.

Therefore, it diminishes the effect of ReLU, requiring more iterations needed for convergence of $c$ .

To solve the aforementioned issues, the paper propose two reweightings:

Reweight the policies in $\pi_{mix}$

Policies dissimlar to the expert policy are favoured more in calculation of $-J^{\pi_{mix}}_\mu(c)$
Reweight the individual trajectories in the expectation $-J^{\pi_{mix}}_\mu(c)$

Less or negligible weight associated with the expert or expert-like trajectories

Both reweightings can be performed using a density estimator. The idea is to learn the desntiy of expert-{like} state-action pairs and computes the negative log-probability (NLP) of any given trajectory’s state-action pair to determine whether it is expert-{like} or not. Practically, estimated by computING the mean and std. deviation of the NLP of expert state-action paris Afterwards duing test time, we can check if given NLP is within one std. deviation of the mean or not.