Uncertainty-Aware Reinforcement Learning for Risk-Sensitive Player Evaluation in Sports Game

An uncertainty-aware Reinforcement Learning (RL) framework to learn a risk-sensitive player evaluation metric from stochastic game dynamics.

G Liu, Y Luo, O Schulte, P Poupart

Published

13 February 2024

abstract

previous methods

measure only impact of actions on

w/out risk induced by stochastic game dynamics

this paper

risk sensitive player evaluation metric from stochastic game dynamics

aleatoric uncertainty

intrinsic stochasticity in a sports game
epistemic uncertainty

model’s insufficient knowledge regarding Out-of-Distribution (OoD)

Risk-sensitive Game Impact Metric (RiGIM)

measure performance by conditioning on a specific confidence level

introduction

cannot differentiate risk-seeking & risk-averse

want to:

model distributions of action values

⇒ use distributional Reinforcement Learning

to predict the supporting quantiles

previous dRL

deterministic transitions (Atari etc)

sport ⇒ stochastic game dynamics & complex context features

⇒ fixed dataset w.out exploration

main idea of framework to evaluate types of uncertainties

Aleatoric uncertainty

the intrinsic stochasticity of game dynamics

caused by stochastic rewards, transition dynamics, and policies

can be captured by

distributional Bellman operator

by implementing

Temporal-Difference (TD) learning
Epistemic uncertainty

due to finite training samples and OoD state-action pairs

can overcome if given

sufficient exploration in the environment

modelled with

Feature-Space Conditional Normalizing flow (FS-CNF)

contribution

uncertainty-aware RL that enables post-hoc calibration

according to aleatoric and epistemic uncertainties
distributional Bellman operator captures

aleatoric uncertainty with action-value distributions
feature-space density estimator that estimates the epistemic uncertainty
RiGIM is the first risk-sensitive metric

uncertainty-aware rl for player evaluation

Markov Game model

finite-horizon markov game model

evaluate by how much their action influence scoring goal

broke down into goal-scoring episodes

begin after a goal
terminates when next goal is scored

allows faster model convergence and more accurate evaluations

to alleviate partial-observability

includes game history

uncertainty-aware rl for player evaluation

no guarantee tha the training and testing distributions are the same

to deal with OoD

perform post-hoc calibration of predicted action values

by modelling epistemic uncertainty

since aleatoric uncertainty can be influenced by epistemic uncertainty

filter OoD samples

modelling the uncertainty of action values

distributional RL for aleatoric uncertainty

learns distribution of random variable $Z_k(s_t, a_t)$

of future goals when a player on team k performs action a_t in state s_t

sum of discounted rewards $\sum^{T_H}_{\iota = t} \gamma^\iota R_{k,\iota} (S_\iota, A_\iota)$ where $A_\iota \sim \pi (\cdot |S_\iota)$

following Quantile-Regression-DQN method

represent distribution of Z by a

uniform mixture of N support quantiles

$\hat{Z}_k(s_t,a_t) = \frac{1}{N}\sum^N_{i=1} \delta_{\theta_{k,i} (s_t,a_t)}$

where $\theta_{k,i}$ estimates the quantile at quantile level $\hat{\tau}_i = \frac{\tau_{i-1} + \tau_i}{2}$ and $\delta_{\theta_{k,i}}$ denotes a Dirac distribution at $\theta_{k,i}$

distributional bellman operator

the stochastic process can be captured by a distributional Bellman operator $\tau^\pi$

$\tau^\pi Z_k(s_t,a_t) {\coloneqq}^\Delta R_k(s_t,a_t) + \gamma Z_k(S_{t+1},A_{t+1})$

same distribution

estimate supporting quantiles of Z by minimizing quantile Huber loss

$\frac{1}{N} \sum^N_{i=1}\sum^N_{i^\prime = 1}\rho^\eta_{\hat{\tau}_i}(r + \gamma\theta_{i^\prime,k}(s_{t+1},a_{t+1}) - \theta_{i,k}(s_t,a_t))$

Proposition 1. Assume the Bellman consistence holds by $\hat{Z} :\triangleq R + \gamma P^\pi \hat{Z}$ where $\hat{Z}, R$ are vector-valued random variables and $P^\pi$ is the transition matrix of the stationary policy $\pi$ , so $P^\pi_{(s,a),(s^\prime,a^\prime)} = P(s^\prime | a,s)\pi(a^\prime|s^\prime)$ , the uncertainty of action-value distributions $\hat{Z}$ under an entropy measure $H(\cdot)$ can be given by

$H(\hat{Z}) = H[R] - |\mathcal{A}||\mathcal{S}|\log(1-\gamma) + \log|\det(\text{d}^\pi)|$

where $\text{d}^\pi = (1 - \gamma)(I - \gamma P^\pi)^{-1} \in [0,1]^{|S||A|\times|S||A|}$ is the induced matrix for distributions over state-action tuples by following policy $\pi$ and transition $P_\tau$ .

proposition 1 disentangles the entropy of Z into

entropy of reward variables that quantifies the uncertainty of current rewards
the uncertainty induced by discount factor
log-absolute determinant of induced distribution matrix

proposition 1 demonstrates that

key components for representing aleatoric uncertainty can be caputred by Z when the Bellman consistency is reach by learning

cannot be generalized since insufficient exploration

need to estimate epistemic uncertainty

density estimator for epistemic uncertainty

feature space conditional normalizing flow (FS_CNF)

feature extractor

to avoid feature collapse (mapping OoD to iD)

idea feature extractor subjects to bi-lipschitz constraint

$\beta_1||x_1-x_2||_I \geq ||f_\theta(x_1) - f_\theta(x_2)||_F \geq \beta_2||x_1-x_2||_I \quad \forall x_1, x_2 \in D$

sensitive to distance & smooth
density estimator

masked auto-regressive flow (MAF)

player evaluation

risk-sensitive impact metric

for high input density indicates low epistemic uncertainty

for low input density indicates high epistemic uncertainty

propose RiGIM based on sum of discounted reward & input density $\hat{Z}_k(s,a) \& p(\cdot|z_E)$ where c is the confidence level

$\phi_k(s_{t+1},a_{t+1},c) = [\hat{Z}^c_k(s_{t+1},a_{t+1})-\hat{Z}^c_k(s_t,a_t)]\mathbb{I}_{p(\cdot|z_E)\geq\epsilon} \\ RiGIM_l(c) = \sum_{(s,a)\in D^\prime}{ n(s,a,l)} \times \phi_k(s,a,c)$

risk-averse with large c → better sensitivity to bad outcomes
risk-seeking with small c → better sensitivity to positive outcomes

risk-seeking benefits offensive players while risk-averse benefits defensive players