Uncertainty-Aware Reinforcement Learning for Risk-Sensitive Player Evaluation in Sports Game

An uncertainty-aware Reinforcement Learning (RL) framework to learn a risk-sensitive player evaluation metric from stochastic game dynamics.

G Liu, Y Luo, O Schulte, P Poupart


previous methods

measure only impact of actions on

w/out risk induced by stochastic game dynamics

this paper

risk sensitive player evaluation metric from stochastic game dynamics

  1. aleatoric uncertainty

    intrinsic stochasticity in a sports game

  2. epistemic uncertainty

    model’s insufficient knowledge regarding Out-of-Distribution (OoD)

Risk-sensitive Game Impact Metric (RiGIM)

measure performance by conditioning on a specific confidence level


cannot differentiate risk-seeking & risk-averse

want to:

model distributions of action values

⇒ use distributional Reinforcement Learning

to predict the supporting quantiles

previous dRL

deterministic transitions (Atari etc)

sport ⇒ stochastic game dynamics & complex context features

fixed dataset w.out exploration

main idea of framework to evaluate types of uncertainties

  1. Aleatoric uncertainty

    the intrinsic stochasticity of game dynamics

    caused by stochastic rewards, transition dynamics, and policies

    can be captured by

    distributional Bellman operator

    by implementing

    Temporal-Difference (TD) learning

  2. Epistemic uncertainty

    due to finite training samples and OoD state-action pairs

    can overcome if given

    sufficient exploration in the environment

    modelled with

    Feature-Space Conditional Normalizing flow (FS-CNF)


  1. uncertainty-aware RL that enables post-hoc calibration

    according to aleatoric and epistemic uncertainties

  2. distributional Bellman operator captures

    aleatoric uncertainty with action-value distributions

  3. feature-space density estimator that estimates the epistemic uncertainty
  4. RiGIM is the first risk-sensitive metric

uncertainty-aware rl for player evaluation

Markov Game model

finite-horizon markov game model

evaluate by how much their action influence scoring goal

broke down into goal-scoring episodes

  1. begin after a goal
  2. terminates when next goal is scored

allows faster model convergence and more accurate evaluations

to alleviate partial-observability

includes game history

no guarantee tha the training and testing distributions are the same

to deal with OoD

perform post-hoc calibration of predicted action values

by modelling epistemic uncertainty

since aleatoric uncertainty can be influenced by epistemic uncertainty

filter OoD samples

modelling the uncertainty of action values

distributional RL for aleatoric uncertainty

learns distribution of random variable Zk(st,at)Z_k(s_t, a_t)

of future goals when a player on team k performs action a_t in state s_t

sum of discounted rewards ι=tTHγιRk,ι(Sι,Aι)\sum^{T_H}_{\iota = t} \gamma^\iota R_{k,\iota} (S_\iota, A_\iota) where Aιπ(Sι)A_\iota \sim \pi (\cdot |S_\iota)

following Quantile-Regression-DQN method

represent distribution of Z by a

uniform mixture of N support quantiles

Z^k(st,at)=1Ni=1Nδθk,i(st,at) \hat{Z}_k(s_t,a_t) = \frac{1}{N}\sum^N_{i=1} \delta_{\theta_{k,i} (s_t,a_t)}

where θk,i\theta_{k,i} estimates the quantile at quantile level τ^i=τi1+τi2\hat{\tau}_i = \frac{\tau_{i-1} + \tau_i}{2} and δθk,i\delta_{\theta_{k,i}} denotes a Dirac distribution at θk,i\theta_{k,i}

distributional bellman operator

the stochastic process can be captured by a distributional Bellman operator $\tau^\pi$

τπZk(st,at)ΔRk(st,at)+γZk(St+1,At+1) \tau^\pi Z_k(s_t,a_t) {\coloneqq}^\Delta R_k(s_t,a_t) + \gamma Z_k(S_{t+1},A_{t+1})

same distribution

estimate supporting quantiles of Z by minimizing quantile Huber loss

1Ni=1Ni=1Nρτ^iη(r+γθi,k(st+1,at+1)θi,k(st,at)) \frac{1}{N} \sum^N_{i=1}\sum^N_{i^\prime = 1}\rho^\eta_{\hat{\tau}_i}(r + \gamma\theta_{i^\prime,k}(s_{t+1},a_{t+1}) - \theta_{i,k}(s_t,a_t))

Proposition 1. Assume the Bellman consistence holds by Z^:R+γPπZ^\hat{Z} :\triangleq R + \gamma P^\pi \hat{Z} where Z^,R\hat{Z}, R are vector-valued random variables and PπP^\pi is the transition matrix of the stationary policy π\pi, so P(s,a),(s,a)π=P(sa,s)π(as)P^\pi_{(s,a),(s^\prime,a^\prime)} = P(s^\prime | a,s)\pi(a^\prime|s^\prime) , the uncertainty of action-value distributions Z^\hat{Z} under an entropy measure H()H(\cdot) can be given by

H(Z^)=H[R]ASlog(1γ)+logdet(dπ) H(\hat{Z}) = H[R] - |\mathcal{A}||\mathcal{S}|\log(1-\gamma) + \log|\det(\text{d}^\pi)|

where dπ=(1γ)(IγPπ)1[0,1]SA×SA\text{d}^\pi = (1 - \gamma)(I - \gamma P^\pi)^{-1} \in [0,1]^{|S||A|\times|S||A|} is the induced matrix for distributions over state-action tuples by following policy π\pi and transition PτP_\tau.

proposition 1 disentangles the entropy of Z into

  1. entropy of reward variables that quantifies the uncertainty of current rewards
  2. the uncertainty induced by discount factor
  3. log-absolute determinant of induced distribution matrix

proposition 1 demonstrates that

key components for representing aleatoric uncertainty can be caputred by Z when the Bellman consistency is reach by learning

cannot be generalized since insufficient exploration

need to estimate epistemic uncertainty

density estimator for epistemic uncertainty

feature space conditional normalizing flow (FS_CNF)

  1. feature extractor

    to avoid feature collapse (mapping OoD to iD)

    idea feature extractor subjects to bi-lipschitz constraint

    β1x1x2Ifθ(x1)fθ(x2)Fβ2x1x2Ix1,x2D \beta_1||x_1-x_2||_I \geq ||f_\theta(x_1) - f_\theta(x_2)||_F \geq \beta_2||x_1-x_2||_I \quad \forall x_1, x_2 \in D

    sensitive to distance & smooth

  2. density estimator

    masked auto-regressive flow (MAF)

player evaluation

risk-sensitive impact metric

for high input density indicates low epistemic uncertainty

for low input density indicates high epistemic uncertainty

propose RiGIM based on sum of discounted reward & input density Z^k(s,a)&p(zE)\hat{Z}_k(s,a) \& p(\cdot|z_E) where c is the confidence level

ϕk(st+1,at+1,c)=[Z^kc(st+1,at+1)Z^kc(st,at)]Ip(zE)ϵRiGIMl(c)=(s,a)Dn(s,a,l)×ϕk(s,a,c) \phi_k(s_{t+1},a_{t+1},c) = [\hat{Z}^c_k(s_{t+1},a_{t+1})-\hat{Z}^c_k(s_t,a_t)]\mathbb{I}_{p(\cdot|z_E)\geq\epsilon} \\ RiGIM_l(c) = \sum_{(s,a)\in D^\prime}{ n(s,a,l)} \times \phi_k(s,a,c)

  1. risk-averse with large c → better sensitivity to bad outcomes
  2. risk-seeking with small c → better sensitivity to positive outcomes

risk-seeking benefits offensive players while risk-averse benefits defensive players