Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and do not provide any finite-sample analysis. We also propose an accelerated algorithm, called GTD2-MP, that uses proximal "mirror maps" to yield an improved convergence rate. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

Download Full-text

On the Asymptotic Equivalence Between Differential Hebbian and Temporal Difference Learning

Neural Computation ◽

10.1162/neco.2008.04-08-750 ◽

2009 ◽

Vol 21 (4) ◽

pp. 1173-1202 ◽

Cited By ~ 7

Author(s):

Christoph Kolodziejski ◽

Bernd Porr ◽

Florentin Wörgötter

Keyword(s):

Reinforcement Learning ◽

Hebbian Learning ◽

Mathematical Proof ◽

Temporal Difference ◽

Temporal Difference Learning ◽

Asymptotic Equivalence ◽

Theoretical Contribution ◽

Network Learning ◽

Learning Framework

In this theoretical contribution, we provide mathematical proof that two of the most important classes of network learning—correlation-based differential Hebbian learning and reward-based temporal difference learning—are asymptotically equivalent when timing the learning with a modulatory signal. This opens the opportunity to consistently reformulate most of the abstract reinforcement learning framework from a correlation-based perspective more closely related to the biophysics of neurons.

Download Full-text

Choice-selective sequences dominate in cortical relative to thalamic inputs to nucleus accumbens, providing a potential substrate for credit assignment

10.1101/725382 ◽

2019 ◽

Cited By ~ 2

Author(s):

Nathan F. Parker ◽

Avinash Baidya ◽

Julia Cox ◽

Laura Haetzel ◽

Anna Zhukovskaya ◽

...

Keyword(s):

Reinforcement Learning ◽

Nucleus Accumbens ◽

Learning Task ◽

Temporal Difference ◽

Prelimbic Cortex ◽

Temporal Difference Learning ◽

Credit Assignment ◽

Cortical Inputs ◽

Selective Activity ◽

Potential Substrate

How are actions linked with subsequent outcomes to guide choices? The nucleus accumbens, which is implicated in this process, receives glutamatergic inputs from the prelimbic cortex and midline regions of the thalamus. However, little is known about what is represented in these input pathways. By comparing these inputs during a reinforcement learning task in mice, we discovered that prelimbic cortical inputs preferentially represent actions and choices, whereas midline thalamic inputs preferentially represent cues. Choice-selective activity in the prelimbic cortical inputs is organized in sequences that persist beyond the outcome. Through computational modeling, we demonstrate that these sequences can support the neural implementation of temporal difference learning, a powerful algorithm to connect actions and outcomes across time. Finally, we test and confirm predictions of our circuit model by direct manipulation of nucleus accumbens input neurons. Thus, we integrate experiment and modeling to suggest a neural solution for credit assignment.

Download Full-text

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5784 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3741-3748

Author(s):

Kristopher De Asis ◽

Alan Chan ◽

Silviu Pitis ◽

Richard Sutton ◽

Daniel Graves

Keyword(s):

Reinforcement Learning ◽

Function Approximation ◽

Value Function ◽

Temporal Difference ◽

Value Functions ◽

Difference Methods ◽

Td Methods ◽

The Stability ◽

The Value Function ◽

Temporal Difference Methods

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as “the deadly triad”). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and n-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.

Download Full-text

Derivatives of Logarithmic Stationary Distributions for Policy Gradient Reinforcement Learning

Neural Computation ◽

10.1162/neco.2009.12-08-922 ◽

2010 ◽

Vol 22 (2) ◽

pp. 342-376 ◽

Cited By ~ 2

Author(s):

Tetsuro Morimura ◽

Eiji Uchibe ◽

Junichiro Yoshimoto ◽

Jan Peters ◽

Kenji Doya

Keyword(s):

Reinforcement Learning ◽

Stationary State ◽

Temporal Difference ◽

Value Functions ◽

Average Reward ◽

Temporal Difference Learning ◽

State Distribution ◽

Learning Framework ◽

Policy Gradient ◽

Derivatives Of

Most conventional policy gradient reinforcement learning (PGRL) algorithms neglect (or do not explicitly make use of) a term in the average reward gradient with respect to the policy parameter. That term involves the derivative of the stationary state distribution that corresponds to the sensitivity of its distribution to changes in the policy parameter. Although the bias introduced by this omission can be reduced by setting the forgetting rate γ for the value functions close to 1, these algorithms do not permit γ to be set exactly at γ = 1. In this article, we propose a method for estimating the log stationary state distribution derivative (LSD) as a useful form of the derivative of the stationary state distribution through backward Markov chain formulation and a temporal difference learning framework. A new policy gradient (PG) framework with an LSD is also proposed, in which the average reward gradient can be estimated by setting γ = 0, so it becomes unnecessary to learn the value functions. We also test the performance of the proposed algorithms using simple benchmark tasks and show that these can improve the performances of existing PG methods.

Download Full-text

The Successor Representation and Temporal Context

Neural Computation ◽

10.1162/neco_a_00282 ◽

2012 ◽

Vol 24 (6) ◽

pp. 1553-1568 ◽

Cited By ~ 43

Author(s):

Samuel J. Gershman ◽

Christopher D. Moore ◽

Michael T. Todd ◽

Kenneth A. Norman ◽

Per B. Sederberg

Keyword(s):

Reinforcement Learning ◽

Episodic Memory ◽

Learning Algorithm ◽

Learning Systems ◽

Temporal Context ◽

Temporal Difference ◽

Temporal Difference Learning ◽

Context Model ◽

Neural Processes ◽

Interesting Possibility

The successor representation was introduced into reinforcement learning by Dayan ( 1993 ) as a means of facilitating generalization between states with similar successors. Although reinforcement learning in general has been used extensively as a model of psychological and neural processes, the psychological validity of the successor representation has yet to be explored. An interesting possibility is that the successor representation can be used not only for reinforcement learning but for episodic learning as well. Our main contribution is to show that a variant of the temporal context model (TCM; Howard & Kahana, 2002 ), an influential model of episodic memory, can be understood as directly estimating the successor representation using the temporal difference learning algorithm (Sutton & Barto, 1998 ). This insight leads to a generalization of TCM and new experimental predictions. In addition to casting a new normative light on TCM, this equivalence suggests a previously unexplored point of contact between different learning systems.

Download Full-text

Selective network discovery via deep reinforcement learning on embedded spaces

Applied Network Science ◽

10.1007/s41109-021-00365-8 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Peter Morales ◽

Rajmonda Sulo Caceres ◽

Tina Eliassi-Rad

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Sequential Decision ◽

Network Discovery ◽

Learning Tasks ◽

Partially Observed ◽

Decision Making Problem ◽

Resource Collection ◽

Improved Performance ◽

Discovery Algorithms

AbstractComplex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks given resource collection constraints are of great interest. In this paper, we formulate the task-specific network discovery problem as a sequential decision-making problem. Our downstream task is selective harvesting, the optimal collection of vertices with a particular attribute. We propose a framework, called network actor critic (NAC), which learns a policy and notion of future reward in an offline setting via a deep reinforcement learning algorithm. The NAC paradigm utilizes a task-specific network embedding to reduce the state space complexity. A detailed comparative analysis of popular network embeddings is presented with respect to their role in supporting offline planning. Furthermore, a quantitative study is presented on various synthetic and real benchmarks using NAC and several baselines. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms. Finally, we outline learning regimes where planning is critical in addressing sparse and changing reward signals.

Download Full-text