Learning Soft Constraints From Constrained Expert Demonstrations
An algorithm that captures the soft constraints with some expert demonstrations.
Ashish Gaurav, Kasra Rezaee, Guiliang Liu, Pascal Poupart
Abstract
For the setting where the reward function is given and the costraints are unknown, the paper purposes a method to recover constraints satisfactorily from the expert data. Different from previous attempts to recover the hard constraits, the purposed method is capable of capturing cumulative soft constraints, where the agents satisfies on average per epsidoe. This problem can be solved by iteratively adjusting the constraint function through a constrained optimization procedure until the agent behavior matches the expert behavior.
Introduction
Inverse Constraint Learning (ICL)
ICL is defined as the task of extracting the implicit constraint functions associated with the given {near}-optimal behaviors. This work propsoed a novel method to learn the cumulative soft constarints from expert demonstrations, assuming the reward function is given. The different between hard constraints and soft constraints are
- Hard Constraints - the constraint is always satisfied for any individual trajectory
- Soft Constraints - the constraints are not necessarily satisfied in every trajectory, but rather satisfied in expectation. Therefore, there may be certain trajectories when the constraint is violated, but in expectation, it is satisfied.
Contributions
The contributions of this paper can be summarized as follows:
- A novel fomulation and method for ICL - works with any state-action spaces.
- First to learn cumulative soft constraints - taking into account noise and possible violations in expert demonstrations
Background
Markov Decision Process
For Markov Decision Process (MDP), see here.
Reinforcement Learning
The objective of standard RL procedure is to obtain an optimal policy that maximizes the expected long-term discounted reward
Constrained Reinforcement Learning
Constrained Reinforcement Learning’s objective function is similar to standard RL’s, with an addition that the expectation of cumulative constraint functions $c_i$ must not exceed associated thresholds $\beta_i$
For simplicity, this work only considers constrained RL with only one constraint function.
Related Work
Majority of the previous works only consider the hard constaints, which is fine in deterministic domains. However, the probability of violating a constraint can rarely be reduced to zero in a stochastic domain. A group of researchers have attempted to extend the framework of maimum entropy inverse reinforcement learning, learning probabilistic constraints that hold with high probability in expert demonstrations. In the above approach, a constraint value is treated as a negative reward. Since the probability of a trajectory is proportional to the exponential of the rewards, this effectively reduces the probability of trajectories with high constraint values.There are many other approaches such as Bayesian approaches; however, none of them correspond to soft constraints.
Approach
Inverse Constraint Learning: Two Phases
Given a reward and demonstrations , we want to output a constraint function s.t. obtained optimal policy explains the behavior in .
The approach is based on the template of IRL, alternating between
- policy optimization phase
- reward adjustment phase
We will first show a theoretical approach, followed with the adaptation to a practical problem.
The theoretical approach starts with an empty set of policies and grows this set by alternating until convergence
Equation (1) performs forward constraint RL to find optimal policy for constraint . Then, this optimal policy is added to set of policies .
This is followed by equation (2) which adjusts the constraint function to increase the constraint values in the policies in . This is done while keeping constraint value for bounded by . In summary, the overarching idea behind equation (2) is to maximizing the accumulated constraint value for the most feasible policy in .
To summarize this procedure, for each iteration, a new policy is found but is then made infeasible by increasing its constraint value past threshold . By doing so, the policy will eventually converge to the expert policy. Intuitively, this is because all policies and trajectories except the expert’s are infeasible in the long run. A rigorous proof can be seen in Theorem 1.
Theorem 1: Assuming there is a unique policy that achieves , the alternation of optimization procedures in (1) and (2) converges to a set of policies such that the last policy added to is equivalent to the expert policy in the sense that and generate the same trajectories.
The proof of theorem 1 can be found here.
However, in practice there are challenges when implementing equation (1) and (2):
- Do not have expert policy $\pi^E$ but rather expert trajectories based on the expert policy
- $\Pi$ could grow very large before convergence is achieved
- Convergence is not guaranteed or may converge prematurely
Instead, the paper uses an approximation approach by replacing (2) with a simpler optimization
Instead of the max-min optimization of the constraint values of policies in , the paper implement a maximization of the constraint value of the mixture of policies in , where the mixture is a collectino of optimal policies each with a weight. Even though this modification loses the theoretical guarantee of convergence, the paper still finds in experiments that the algo converges empirically.
Solving the Constrained Opimitzations through the Penalty Method
Note that equation (1), (2), and (3) all belong to the following general class of optimization problems:
Most apporaches formulate this problem from a Lagrangian perspective as a min-max problem that can be solved by gradient ascent-descent type algorithms. However, Lagrangian formulations are challenging in terms of empirical convergence, suffering from oscillatory behaviors.
Therefore, the paper investigates the penalty method, which converts a constrained problem into an unconstrained problem with a non differentiable ReLU.
After instantiating . the rough repeating procedure is as follows
- Finds a feasible solution by repeated nudging in the direction of until
- Optimizes a soft loss that simultaneously minimizes while keeping within feasible region
Compared to Lagrangian based approahces, the penalty method has the advantage of
- Simpler algorithmic implementation
- Performing well empirically
Improving Constraint Adjustment
Constraint Adjustment in another words is finding the decision boundary between the expert and agent.
We can further improve constraint adjustment by noticing that an overlap in expert and agent data used in computing the terms and , which hampers convergence.
Consider the soft loss objective for (3)
Note that the soft loss vanishes depending on whether $J^{\pi_E}_\mu(c) - \beta \leq 0$ or not. There are two cases:
-
If
Assume there are some expert-{like} trajectories in agent data, we will end up increasing the constraint value across these trajectories.
This behavior is undesirable as it leads to being more infeasible.
-
If
Note that there is a non-zero ReLU term.
Assume there are some expert-{like} trajectories in agent data, we take the gradient of . Then, we would get two contrasting gradient terms, trying to increase and decrease the constraint value across the same expert trajectories.
Therefore, it diminishes the effect of ReLU, requiring more iterations needed for convergence of .
To solve the aforementioned issues, the paper propose two reweightings:
-
Reweight the policies in
Policies dissimlar to the expert policy are favoured more in calculation of
-
Reweight the individual trajectories in the expectation
Less or negligible weight associated with the expert or expert-like trajectories
Both reweightings can be performed using a density estimator. The idea is to learn the desntiy of expert-{like} state-action pairs and computes the negative log-probability (NLP) of any given trajectory’s state-action pair to determine whether it is expert-{like} or not. Practically, estimated by computING the mean and std. deviation of the NLP of expert state-action paris Afterwards duing test time, we can check if given NLP is within one std. deviation of the mean or not.