Inverse Reinforcement Learning for Team Sports - Valuing Actions and Players
A method that alternates single-agent IRL to learn a reward function for multiple agents.
Yudong Luo, Oliver Schulte, and Pascal Poupart
rl vs irl
reinforcement learning vs inverse reinforcement learning
- RL: The primary goal of RL is for the agent to learn an optimal policy by interacting with the environment to maximize cumulative rewards.
- IRL: In contrast, the goal of IRL is to infer the underlying reward function from observed behavior or expert demonstrations, rather than learning a policy directly.
abstract
sparse explicit reward
hockey & soccer
combines Q-function learning with inverse reinforcement learning (IRL)
alternates single-agent IRL → reward function for multiple agents
knowledge transfer → combine learned rewards & observed rewards
introduction
sparse explicit reward
-
Q-values show little variance
→ inverse reinforcement learning w/ domain knowledge (IRL-DK)
-
performance evaluation biased towards offensive players
IRL-DK
IRL → optimizing unobserved internal reward function
demonstrations & domain knowledge
alternating learning
single-agent IRL for multi-agent Markov games
procedure:
- treat B as A’s environment
- learns a reward function for A in single-agent MDP
- flip the role and repeat the process
contribution
- IRL for multi-agent dynamics
- transfer learning for combining
- sparse explicit rewards
- learned dense implicit reward
- alternating procedure
Markov Game Model for Ice Hockey
Markov Games and Decision Processes
games extend mdp
mdp is a single-agent markov game w/ k=1
policy
action probability is a function of game state
actions are independent given current game state
game value function
home & away as two agents → same action space
Alternating Learning for Multi-Agent IRL
marginal MDP where
IRL with Domain Knowledge
maxent irl
maximum entropy IRL → interpretable linear model
reward for trajectory $\zeta$ is cumulative reward of visited states
with reward weights $\theta \in \mathbb{R}^k, r_\theta(s) = \theta^Tf_s$
probability of demonstrated trajectory $\zeta$ increases exponentially with higher rewards
if follows max entroy
probability can be estimated by
where is partition function and T is transition distribution
maxent irl w/ dk
maxent irl fails to learn importance of goals due to sparsity
rule reward function assigns 1 for goal & 0 for others
want to
combine rule reward w/ maxent irl
is the state feature matrix
mmd - maximum mean discrepancy
measurement of the difference between two probability distribution
unbiased estimation of squared MMD
where is a kernel function
the optimal is derived by
evaluating learned reward and policy
reward density
goal is to complement sparse reward
to cover many situations
want
variance of learned rewards to be substantially higher than goal rewards
reflected in results
policy evaluation
to evaluate how well the reward function rationalizes player behavior
best performing one yet
learning performance
with regularization, way more stable and converges
player ranking
action impact values
difference made by an action