Fast gradient-descent methods for temporal-difference learning with linear function approximation

Author(s):  
Richard S. Sutton ◽  
Hamid Reza Maei ◽  
Doina Precup ◽  
Shalabh Bhatnagar ◽  
David Silver ◽  
...  
2021 ◽  
Author(s):  
Jalaj Bhandari ◽  
Daniel Russo ◽  
Raghav Singal

Temporal difference learning (TD) is a simple iterative algorithm widely used for policy evaluation in Markov reward processes. Bhandari et al. prove finite time convergence rates for TD learning with linear function approximation. The analysis follows using a key insight that establishes rigorous connections between TD updates and those of online gradient descent. In a model where observations are corrupted by i.i.d. noise, convergence results for TD follow by essentially mirroring the analysis for online gradient descent. Using an information-theoretic technique, the authors also provide results for the case when TD is applied to a single Markovian data stream where the algorithm’s updates can be severely biased. Their analysis seamlessly extends to the study of TD learning with eligibility traces and Q-learning for high-dimensional optimal stopping problems.


Sign in / Sign up

Export Citation Format

Share Document