Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Kristopher De Asis; Alan Chan; Silviu Pitis; Richard Sutton; Daniel Graves

doi:10.1609/aaai.v34i04.5784

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5784 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3741-3748

Author(s):

Kristopher De Asis ◽

Alan Chan ◽

Silviu Pitis ◽

Richard Sutton ◽

Daniel Graves

Keyword(s):

Reinforcement Learning ◽

Function Approximation ◽

Value Function ◽

Temporal Difference ◽

Value Functions ◽

Difference Methods ◽

Td Methods ◽

The Stability ◽

The Value Function ◽

Temporal Difference Methods

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as “the deadly triad”). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and n-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.

Download Full-text

Evolutionary Algorithms for Reinforcement Learning

Journal of Artificial Intelligence Research ◽

10.1613/jair.613 ◽

1999 ◽

Vol 11 ◽

pp. 241-276 ◽

Cited By ~ 132

Author(s):

D. E. Moriarty ◽

A. C. Schultz ◽

J. J. Grefenstette

Keyword(s):

Reinforcement Learning ◽

Evolutionary Algorithms ◽

Value Function ◽

Learning Problems ◽

Temporal Difference ◽

Genetic Operators ◽

Credit Assignment ◽

Learning Problem ◽

Difference Methods ◽

Temporal Difference Methods

There are two distinct approaches to solving reinforcement learning problems, namely, searching in value function space and searching in policy space. Temporal difference methods and evolutionary algorithms are well-known examples of these approaches. Kaelbling, Littman and Moore recently provided an informative survey of temporal difference methods. This article focuses on the application of evolutionary algorithms to the reinforcement learning problem, emphasizing alternative policy representations, credit assignment methods, and problem-specific genetic operators. Strengths and weaknesses of the evolutionary approach to reinforcement learning are presented, along with a survey of representative applications.

Download Full-text

Improved Temporal Difference Methods with Linear Function Approximation

Handbook of Learning and Approximate Dynamic Programming ◽

10.1109/9780470544785.ch9 ◽

2009 ◽

Keyword(s):

Linear Function ◽

Function Approximation ◽

Temporal Difference ◽

Linear Function Approximation ◽

Difference Methods ◽

Temporal Difference Methods

Download Full-text

Comparing evolutionary and temporal difference methods in a reinforcement learning domain

Proceedings of the 8th annual conference on Genetic and evolutionary computation - GECCO '06 ◽

10.1145/1143997.1144202 ◽

2006 ◽

Cited By ~ 34

Author(s):

Matthew E. Taylor ◽

Shimon Whiteson ◽

Peter Stone

Keyword(s):

Reinforcement Learning ◽

Temporal Difference ◽

Difference Methods ◽

Temporal Difference Methods

Download Full-text

A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5779 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3701-3708

Author(s):

Gal Dalal ◽

Balazs Szorenyi ◽

Gugan Thoppe

Keyword(s):

Reinforcement Learning ◽

Convergence Rate ◽

Policy Evaluation ◽

Finite Time ◽

High Probability ◽

Temporal Difference ◽

Time Analysis ◽

Difference Methods ◽

Temporal Difference Methods ◽

Two Timescale Stochastic Approximation

Policy evaluation in reinforcement learning is often conducted using two-timescale stochastic approximation, which results in various gradient temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide convergence rate bounds for this suite of algorithms. Algorithms such as these have two iterates, θn and wn, which are updated using two distinct stepsize sequences, αn and βn, respectively. Assuming αn = n−α and βn = n−β with 1 > α > β > 0, we show that, with high probability, the two iterates converge to their respective solutions θ* and w* at rates given by ∥θn - θ*∥ = Õ(n−α/2) and ∥wn - w*∥ = Õ(n−β/2); here, Õ hides logarithmic terms. Via comparable lower bounds, we show that these bounds are, in fact, tight. To the best of our knowledge, ours is the first finite-time analysis which achieves these rates. While it was known that the two timescale components decouple asymptotically, our results depict this phenomenon more explicitly by showing that it in fact happens from some finite time onwards. Lastly, compared to existing works, our result applies to a broader family of stepsizes, including non-square summable ones.

Download Full-text

Integrating Temporal Difference Methods and Self-Organizing Neural Networks for Reinforcement Learning With Delayed Evaluative Feedback

IEEE Transactions on Neural Networks ◽

10.1109/tnn.2007.905839 ◽

2008 ◽

Vol 19 (2) ◽

pp. 230-244 ◽

Cited By ~ 57

Author(s):

Ah-Hwee Tan ◽

Ning Lu ◽

Dan Xiao

Keyword(s):

Neural Networks ◽

Reinforcement Learning ◽

Temporal Difference ◽

Evaluative Feedback ◽

Difference Methods ◽

Temporal Difference Methods ◽

Self Organizing

Download Full-text

Motor Cortex Encodes A Temporal Difference Reinforcement Learning Process

10.1101/257337 ◽

2018 ◽

Cited By ~ 2

Author(s):

Venkata S Aditya Tarigoppula ◽

John S Choi ◽

John P Hessburg ◽

David B McNiel ◽

Brandi T Marsh ◽

...

Keyword(s):

Reinforcement Learning ◽

Motor Cortex ◽

Learning Process ◽

Neural Activity ◽

Prediction Error ◽

Value Function ◽

Temporal Difference ◽

Reward Prediction Error ◽

Reward Prediction ◽

The Value Function

AbstractTemporal difference reinforcement learning (TDRL) accurately models associative learning observed in animals, where they learn to associate outcome predicting environmental states, termed conditioned stimuli (CS), with the value of outcomes, such as rewards, termed unconditioned stimuli (US). A component of TDRL is the value function, which captures the expected cumulative future reward from a given state. The value function can be modified by changes in the animal’s knowledge, such as by the predictability of its environment. Here we show that primary motor cortical (M1) neurodynamics reflect a TD learning process, encoding a state value function and reward prediction error in line with TDRL. M1 responds to the delivery of reward, and shifts its value related response earlier in a trial, becoming predictive of an expected reward, when reward is predictable due to a CS. This is observed in tasks performed manually or observed passively, as well as in tasks without an explicit CS predicting reward, but simply with a predictable temporal structure, that is a predictable environment. M1 also encodes the expected reward value associated with a set of CS in a multiple reward level CS-US task. Here we extend the Microstimulus TDRL model, reported to accurately capture RL related dopaminergic activity, to account for M1 reward related neural activity in a multitude of tasks.Significance statementThere is a great deal of agreement between aspects of temporal difference reinforcement learning (TDRL) models and neural activity in dopaminergic brain centers. Dopamine is know to be necessary for sensorimotor learning induced synaptic plasticity in the motor cortex (M1), and thus one might expect to see the hallmarks of TDRL in M1, which we show here in the form of a state value function and reward prediction error during. We see these hallmarks even when a conditioned stimulus is not available, but the environment is predictable, during manual tasks with agency, as well as observational tasks without agency. This information has implications towards autonomously updating brain machine interfaces as others and we have proposed and published on.

Download Full-text

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

Journal of Artificial Intelligence Research ◽

10.1613/jair.639 ◽

2000 ◽

Vol 13 ◽

pp. 227-303 ◽

Cited By ~ 389

Author(s):

T. G. Dietterich

Keyword(s):

Reinforcement Learning ◽

Optimal Policy ◽

Value Function ◽

Learning Algorithm ◽

Value Functions ◽

Procedural Semantics ◽

Hierarchical Reinforcement Learning ◽

Model Free ◽

Function Decomposition ◽

The Value Function

This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics---as a subroutine hierarchy---and a declarative semantics---as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this non-hierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

Download Full-text

Conditions on Features for Temporal Difference-Like Methods to Converge

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/357 ◽

2019 ◽

Author(s):

Marcus Hutter ◽

Samuel Yang-Zhao ◽

Sultan Javed Majeed

Keyword(s):

Reinforcement Learning ◽

Function Approximation ◽

Bellman Equation ◽

Complete Characterization ◽

Value Functions ◽

Approximation Space ◽

State Aggregation ◽

Linear Function Approximation ◽

The Value Function

The convergence of many reinforcement learning (RL) algorithms with linear function approximation has been investigated extensively but most proofs assume that these methods converge to a unique solution. In this paper, we provide a complete characterization of non-uniqueness issues for a large class of reinforcement learning algorithms, simultaneously unifying many counter-examples to convergence in a theoretical framework. We achieve this by proving a new condition on features that can determine whether the convergence assumptions are valid or non-uniqueness holds. We consider a general class of RL methods, which we call natural algorithms, whose solutions are characterized as the fixed point of a projected Bellman equation. Our main result proves that natural algorithms converge to the correct solution if and only if all the value functions in the approximation space satisfy a certain shape. This implies that natural algorithms are, in general, inherently prone to converge to the wrong solution for most feature choices even if the value function can be represented exactly. Given our results, we show that state aggregation-based features are a safe choice for natural algorithms and also provide a condition for finding convergent algorithms under other feature constructions.

Download Full-text

Solving flow-shop scheduling problem with a reinforcement learning algorithm that generalizes the value function with neural network

Alexandria Engineering Journal ◽

10.1016/j.aej.2021.01.030 ◽

2021 ◽

Vol 60 (3) ◽

pp. 2787-2800

Author(s):

Jianfeng Ren ◽

Chunming Ye ◽

Feng Yang

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Value Function ◽

Flow Shop ◽

Learning Algorithm ◽

Flow Shop Scheduling ◽

Scheduling Problem ◽

Shop Scheduling ◽

The Value Function ◽

Reinforcement Learning Algorithm

Download Full-text

Glucose level control using Temporal Difference methods

2017 Iranian Conference on Electrical Engineering (ICEE) ◽

10.1109/iraniancee.2017.7985166 ◽

2017 ◽

Cited By ~ 1

Author(s):

Amin Noori ◽

Mohammad Ali Sadrnia ◽

Mohammad bagher Naghibi Sistani

Keyword(s):

Glucose Level ◽

Temporal Difference ◽

Level Control ◽

Difference Methods ◽

Temporal Difference Methods

Download Full-text