scholarly journals Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling

2021 ◽  
Author(s):  
L. A. Prashanth ◽  
Nathaniel Korda ◽  
Rémi Munos
2021 ◽  
Author(s):  
Jalaj Bhandari ◽  
Daniel Russo ◽  
Raghav Singal

Temporal difference learning (TD) is a simple iterative algorithm widely used for policy evaluation in Markov reward processes. Bhandari et al. prove finite time convergence rates for TD learning with linear function approximation. The analysis follows using a key insight that establishes rigorous connections between TD updates and those of online gradient descent. In a model where observations are corrupted by i.i.d. noise, convergence results for TD follow by essentially mirroring the analysis for online gradient descent. Using an information-theoretic technique, the authors also provide results for the case when TD is applied to a single Markovian data stream where the algorithm’s updates can be severely biased. Their analysis seamlessly extends to the study of TD learning with eligibility traces and Q-learning for high-dimensional optimal stopping problems.


Sign in / Sign up

Export Citation Format

Share Document