Uncertainty-Aware Reinforcement Learning for Risk-Sensitive Player Evaluation in Sports Game
An uncertainty-aware Reinforcement Learning (RL) framework to learn a risk-sensitive player evaluation metric from stochastic game dynamics.
G Liu, Y Luo, O Schulte, P Poupart
abstract
previous methods
measure only impact of actions on
w/out risk induced by stochastic game dynamics
this paper
risk sensitive player evaluation metric from stochastic game dynamics
-
aleatoric uncertainty
intrinsic stochasticity in a sports game
-
epistemic uncertainty
model’s insufficient knowledge regarding Out-of-Distribution (OoD)
Risk-sensitive Game Impact Metric (RiGIM)
measure performance by conditioning on a specific confidence level
introduction
cannot differentiate risk-seeking & risk-averse
want to:
model distributions of action values
⇒ use distributional Reinforcement Learning
to predict the supporting quantiles
previous dRL
deterministic transitions (Atari etc)
sport ⇒ stochastic game dynamics & complex context features
⇒ fixed dataset w.out exploration
main idea of framework to evaluate types of uncertainties
-
Aleatoric uncertainty
the intrinsic stochasticity of game dynamics
caused by stochastic rewards, transition dynamics, and policies
can be captured by
distributional Bellman operator
by implementing
Temporal-Difference (TD) learning
-
Epistemic uncertainty
due to finite training samples and OoD state-action pairs
can overcome if given
sufficient exploration in the environment
modelled with
Feature-Space Conditional Normalizing flow (FS-CNF)
contribution
-
uncertainty-aware RL that enables post-hoc calibration
according to aleatoric and epistemic uncertainties
-
distributional Bellman operator captures
aleatoric uncertainty with action-value distributions
- feature-space density estimator that estimates the epistemic uncertainty
- RiGIM is the first risk-sensitive metric
uncertainty-aware rl for player evaluation
Markov Game model
finite-horizon markov game model
evaluate by how much their action influence scoring goal
broke down into goal-scoring episodes
- begin after a goal
- terminates when next goal is scored
allows faster model convergence and more accurate evaluations
to alleviate partial-observability
includes game history
uncertainty-aware rl for player evaluation
no guarantee tha the training and testing distributions are the same
to deal with OoD
perform post-hoc calibration of predicted action values
by modelling epistemic uncertainty
since aleatoric uncertainty can be influenced by epistemic uncertainty
filter OoD samples
modelling the uncertainty of action values
distributional RL for aleatoric uncertainty
learns distribution of random variable
of future goals when a player on team k performs action a_t in state s_t
sum of discounted rewards where
following Quantile-Regression-DQN method
represent distribution of Z by a
uniform mixture of N support quantiles
where estimates the quantile at quantile level and denotes a Dirac distribution at
distributional bellman operator
the stochastic process can be captured by a distributional Bellman operator $\tau^\pi$
same distribution
estimate supporting quantiles of Z by minimizing quantile Huber loss
Proposition 1. Assume the Bellman consistence holds by where are vector-valued random variables and is the transition matrix of the stationary policy , so , the uncertainty of action-value distributions under an entropy measure can be given by
where is the induced matrix for distributions over state-action tuples by following policy and transition .
proposition 1 disentangles the entropy of Z into
- entropy of reward variables that quantifies the uncertainty of current rewards
- the uncertainty induced by discount factor
- log-absolute determinant of induced distribution matrix
proposition 1 demonstrates that
key components for representing aleatoric uncertainty can be caputred by Z when the Bellman consistency is reach by learning
cannot be generalized since insufficient exploration
need to estimate epistemic uncertainty
density estimator for epistemic uncertainty
feature space conditional normalizing flow (FS_CNF)
-
feature extractor
to avoid feature collapse (mapping OoD to iD)
idea feature extractor subjects to bi-lipschitz constraint
sensitive to distance & smooth
-
density estimator
masked auto-regressive flow (MAF)
player evaluation
risk-sensitive impact metric
for high input density indicates low epistemic uncertainty
for low input density indicates high epistemic uncertainty
propose RiGIM based on sum of discounted reward & input density where c is the confidence level
- risk-averse with large c → better sensitivity to bad outcomes
- risk-seeking with small c → better sensitivity to positive outcomes
risk-seeking benefits offensive players while risk-averse benefits defensive players