Inverse Reinforcement Learning for Team Sports - Valuing Actions and Players

A method that alternates single-agent IRL to learn a reward function for multiple agents.

Yudong Luo, Oliver Schulte, and Pascal Poupart

rl vs irl

reinforcement learning vs inverse reinforcement learning

abstract

sparse explicit reward

hockey & soccer

combines Q-function learning with inverse reinforcement learning (IRL)

alternates single-agent IRL → reward function for multiple agents

knowledge transfer → combine learned rewards & observed rewards

introduction

sparse explicit reward

  1. Q-values show little variance

    → inverse reinforcement learning w/ domain knowledge (IRL-DK)

  2. performance evaluation biased towards offensive players

IRL-DK

IRL → optimizing unobserved internal reward function

demonstrations & domain knowledge

alternating learning

single-agent IRL for multi-agent Markov games

procedure:

contribution

  1. IRL for multi-agent dynamics
  2. transfer learning for combining
    1. sparse explicit rewards
    2. learned dense implicit reward
  3. alternating procedure

Markov Game Model for Ice Hockey

Markov Games and Decision Processes

games extend mdp

mdp is a single-agent markov game w/ k=1

policy πi=SPD(Ai)\pi_i = S \to PD(A_i)

action probability is a function of game state

actions are independent given current game state

game value function Giπi,πi1(s)G^{\pi_i, \pi_{i-1}}_i(s)

home & away as two agents → same action space

Alternating Learning for Multi-Agent IRL

marginal MDP M(πi)S,Ai,r,γ,TM(\pi_{-i}) \coloneqq \langle S, A_i, r^\prime, \gamma, T^\prime \rangle where

r(s,ai)=airi(s,ai,ai)πi(ais)T(sai,s)=aiT(sai,ai,s)πi(ais) r^\prime(s,a_i) = \sum_{a_{-i}}r_i(s,a_i,a_{-i}) \cdot \pi_{-i}(a_{-i}|s) \\ T^\prime(s^\prime|a_i,s) = \sum_{a_{-i}}T(s^\prime|a_i, a_{-i}, s) \cdot \pi_{-i}(a_{-i}|s)

IRL with Domain Knowledge

maxent irl

maximum entropy IRL → interpretable linear model

reward for trajectory $\zeta$ is cumulative reward of visited states

r(ζ)=sjζθTfsj=θTfζ r(\zeta) = \sum_{s_j \in \zeta} \theta^T f_{s_j} = \theta^T f_\zeta

with reward weights $\theta \in \mathbb{R}^k, r_\theta(s) = \theta^Tf_s$

probability of demonstrated trajectory $\zeta$ increases exponentially with higher rewards

if follows max entroy

probability can be estimated by

P(ζθ,T)=erζZ(θ,T)Πst+1,at,stζPT(st+1at,st) P(\zeta | \theta, T) = \frac{e^{r_\zeta}}{Z(\theta, T)} \Pi_{s_{t+1}, a_t, s_t \in \zeta} P_T(s_{t+1}|a_t,s_t)

where Z(θ,T)Z(\theta, T) is partition function and T is transition distribution

maxent irl w/ dk

maxent irl fails to learn importance of goals due to sparsity

rule reward function assigns 1 for goal & 0 for others

want to

combine rule reward w/ maxent irl

θ^=arg maxθL(θ)+λk(rθ,rK)where rθ=θTψ,rK=θKTψ,ψ=[fs1,,fsn]Rk×n \hat{\theta} = \argmax_\theta L(\theta) + \lambda k(r_\theta, r_K) \\ \text{where } r_\theta = \theta^T \psi, r_K = \theta^T_K \psi, \psi = [f_{s_1}, \dots, f_{s_n}] \in \mathbb{R}^{k \times n}

is the state feature matrix

mmd - maximum mean discrepancy

measurement of the difference between two probability distribution

unbiased estimation of squared MMD

d^Hk2(X,Y)=1nx2i=1nxj=1nxk(xi,xj)+1ny2i=1nyj=1nyk(yi,yj)2nxnyi=1nxj=1nyk(xi,yj) \begin{aligned} \hat{d}^2_{\mathcal{H}_k}(X,Y) &= \frac{1}{n^2_x} \sum^{n_x}_{i=1}\sum^{n_x}_{j=1}k(x_i,x_j) + \frac{1}{n^2_y}\sum^{n_y}_{i=1}\sum^{n_y}_{j=1}k(y_i,y_j) \\ &- \frac{2}{n_x n_y}\sum^{n_x}_{i=1}\sum^{n_y}_{j=1}k(x_i,y_j) \end{aligned}

where k(x,x)k(x,x^\prime) is a kernel function

the optimal θ^\hat{\theta} is derived by

θ^=arg maxθL(θ)αd^Hk2(Rθ,RK)=arg maxθL(θ)+2αk(rθ,rK) \begin{aligned} \hat{\theta} &= \argmax_\theta L(\theta) - \alpha\hat{d}^2_{\mathcal{H}_k}(R_\theta, R_K) \\ &= \argmax_\theta L(\theta) + 2\alpha k(r_\theta, r_K) \end{aligned}

evaluating learned reward and policy

reward density

goal is to complement sparse reward

to cover many situations

want

variance of learned rewards to be substantially higher than goal rewards

reflected in results

policy evaluation

to evaluate how well the reward function rationalizes player behavior

best performing one yet

learning performance

with regularization, way more stable and converges

player ranking

action impact values

difference made by an action

impact{H,A}(s,a)Q{H,A}π{H,A}θ^(s,a)V{H,A}π{H,A}θ^(s) \text{impact}_{\{H,A\}}(s,a) \equiv Q^{\pi^{\hat{\theta}}_{\{H,A\}}}_{\{H,A\}}(s,a)-V^{\pi^{\hat{\theta}}_{\{H,A\}}}_{\{H,A\}}(s)