scholarly journals Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning

1995 ◽  
Vol 2 ◽  
pp. 287-318 ◽  
Author(s):  
P. Cichosz

Temporal difference (TD) methods constitute a class of methods for learning predictions in multi-step prediction problems, parameterized by a recency factor lambda. Currently the most important application of these methods is to temporal credit assignment in reinforcement learning. Well known reinforcement learning algorithms, such as AHC or Q-learning, may be viewed as instances of TD learning. This paper examines the issues of the efficient and general implementation of TD(lambda) for arbitrary lambda, for use with reinforcement learning algorithms optimizing the discounted sum of rewards. The traditional approach, based on eligibility traces, is argued to suffer from both inefficiency and lack of generality. The TTD (Truncated Temporal Differences) procedure is proposed as an alternative, that indeed only approximates TD(lambda), but requires very little computation per action and can be used with arbitrary function representation methods. The idea from which it is derived is fairly simple and not new, but probably unexplored so far. Encouraging experimental results are presented, suggesting that using lambda > 0 with the TTD procedure allows one to obtain a significant learning speedup at essentially the same cost as usual TD(0) learning.

1999 ◽  
Vol 11 (8) ◽  
pp. 2017-2060 ◽  
Author(s):  
Csaba Szepesvári ◽  
Michael L. Littman

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.


2020 ◽  
Vol 8 (6) ◽  
pp. 4333-4338

This paper presents a thorough comparative analysis of various reinforcement learning algorithms used by autonomous mobile robots for optimal path finding and, we propose a new algorithm called Iterative SARSA for the same. The main objective of the paper is to differentiate between the Q-learning and SARSA, and modify the latter. These algorithms use either the on-policy or off-policy methods of reinforcement learning. For the on-policy method, we have used the SARSA algorithm and for the off-policy method, the Q-learning algorithm has been used. These algorithms also have an impacting effect on finding the shortest path possible for the robot. Based on the results obtained, we have concluded how our algorithm is better than the current standard reinforcement learning algorithms


2012 ◽  
Vol 22 ◽  
pp. 113-118 ◽  
Author(s):  
Víctor Ricardo Cruz-Álvarez ◽  
Enrique Hidalgo-Peña ◽  
Hector-Gabriel Acosta-Mesa

A common problem working with mobile robots is that programming phase could be a long, expensive and heavy process for programmers. The reinforcement learning algorithms offer one of the most general frameworks in learning subjects. This work presents an approach using the Q-Learning algorithm on a Lego robot in order for it to learn by itself how to follow a blackline drawn down on a white surface, using Matlab [5] as programming environment.


2019 ◽  
Vol 5 (1) ◽  
Author(s):  
Xiao-Ming Zhang ◽  
Zezhu Wei ◽  
Raza Asad ◽  
Xu-Chen Yang ◽  
Xin Wang

Abstract Reinforcement learning has been widely used in many problems, including quantum control of qubits. However, such problems can, at the same time, be solved by traditional, non-machine-learning methods, such as stochastic gradient descent and Krotov algorithms, and it remains unclear which one is most suitable when the control has specific constraints. In this work, we perform a comparative study on the efficacy of three reinforcement learning algorithms: tabular Q-learning, deep Q-learning, and policy gradient, as well as two non-machine-learning methods: stochastic gradient descent and Krotov algorithms, in the problem of preparing a desired quantum state. We found that overall, the deep Q-learning and policy gradient algorithms outperform others when the problem is discretized, e.g. allowing discrete values of control, and when the problem scales up. The reinforcement learning algorithms can also adaptively reduce the complexity of the control sequences, shortening the operation time and improving the fidelity. Our comparison provides insights into the suitability of reinforcement learning in quantum control problems.


Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Xiali Li ◽  
Zhengyu Lv ◽  
Licheng Wu ◽  
Yue Zhao ◽  
Xiaona Xu

In this study, hybrid state-action-reward-state-action (SARSAλ) and Q-learning algorithms are applied to different stages of an upper confidence bound applied to tree search for Tibetan Jiu chess. Q-learning is also used to update all the nodes on the search path when each game ends. A learning strategy that uses SARSAλ and Q-learning algorithms combining domain knowledge for a feedback function for layout and battle stages is proposed. An improved deep neural network based on ResNet18 is used for self-play training. Experimental results show that hybrid online and offline reinforcement learning with a deep neural network can improve the game program’s learning efficiency and understanding ability for Tibetan Jiu chess.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Chunyuan Zhang ◽  
Qi Song ◽  
Zeng Meng

The deep Q-network (DQN) is one of the most successful reinforcement learning algorithms, but it has some drawbacks such as slow convergence and instability. In contrast, the traditional reinforcement learning algorithms with linear function approximation usually have faster convergence and better stability, although they easily suffer from the curse of dimensionality. In recent years, many improvements to DQN have been made, but they seldom make use of the advantage of traditional algorithms to improve DQN. In this paper, we propose a novel Q-learning algorithm with linear function approximation, called the minibatch recursive least squares Q-learning (MRLS-Q). Different from the traditional Q-learning algorithm with linear function approximation, the learning mechanism and model structure of MRLS-Q are more similar to those of DQNs with only one input layer and one linear output layer. It uses the experience replay and the minibatch training mode and uses the agent’s states rather than the agent’s state-action pairs as the inputs. As a result, it can be used alone for low-dimensional problems and can be seamlessly integrated into DQN as the last layer for high-dimensional problems as well. In addition, MRLS-Q uses our proposed average RLS optimization technique, so that it can achieve better convergence performance whether it is used alone or integrated with DQN. At the end of this paper, we demonstrate the effectiveness of MRLS-Q on the CartPole problem and four Atari games and investigate the influences of its hyperparameters experimentally.


2012 ◽  
Vol 19 (Special) ◽  
pp. 31-36 ◽  
Author(s):  
Andrzej Rak ◽  
Witold Gierusz

ABSTRACT This paper presents the application of the reinforcement learning algorithms to the task of autonomous determination of the ship trajectory during the in-harbour and harbour approaching manoeuvres. Authors used Markov decision processes formalism to build up the background of algorithm presentation. Two versions of RL algorithms were tested in the simulations: discrete (Q-learning) and continuous form (Least-Squares Policy Iteration). The results show that in both cases ship trajectory can be found. However discrete Q-learning algorithm suffered from many limitations (mainly curse of dimensionality) and practically is not applicable to the examined task. On the other hand, LSPI gave promising results. To be fully operational, proposed solution should be extended by taking into account ship heading and velocity and coupling with advanced multi-variable controller.


2021 ◽  
Vol 2131 (3) ◽  
pp. 032103
Author(s):  
A P Badetskii ◽  
O A Medved

Abstract The article discusses the issues of choosing a route and an option of cargo flows in multimodal connection in modern conditions. Taking into account active development of artificial intelligence and digital technologies in all types of production activities, it is proposed to use reinforcement learning algorithms to solve the problem. An analysis of the existing algorithms has been carried out, on the basis of which it was found that when choosing a route option for cargo in a multimodal connection, it would be useful to have a qualitative assessment of terminal states. To obtain such an estimate, the Q-learning algorithm was applied in the article, which showed sufficient convergence and efficiency.


Author(s):  
Alla Evseenko ◽  
◽  
Dmitrii Romannikov ◽  

Today, such a branch of science as «artificial intelligence» is booming in the world. Systems built on the basis of artificial intelligence methods have the ability to perform functions that are traditionally considered the prerogative of man. Artificial intelligence has a wide range of research areas. One such area is machine learning. This article discusses the algorithms of one of the approaches of machine learning – reinforcement learning (RL), according to which a lot of research and development has been carried out over the past seven years. Development and research on this approach is mainly carried out to solve problems in Atari 2600 games or in other similar ones. In this article, reinforcement training will be applied to one of the dynamic objects – an inverted pendulum. As a model of this object, we consider a model of an inverted pendulum on a cart taken from the Gym library, which contains many models that are used to test and analyze reinforcement learning algorithms. The article describes the implementation and study of two algorithms from this approach, Deep Q-learning and Double Deep Q-learning. As a result, training, testing and training time graphs for each algorithm are presented, on the basis of which it is concluded that it is desirable to use the Double Deep Q-learning algorithm, because the training time is approximately 2 minutes and provides the best control for the model of an inverted pendulum on a cart.


2021 ◽  
Author(s):  
Ricardo Salmon

It is shown that associative memory networks are capable of solving immediate and general reinforcement learning (RL) problems by combining techniques from associative neural networks and reinforcement learning and in particular Q-learning. The modified model is shown to outperform native RL techniques on a stochastic grid world task by developing correct policies. In addition, we formulated an analogous method to add feature extraction as dimensional reduction and eligibility traces as another mechanism to help solve the credit assignment problem. The network contrary to pure RL methods is based on associative memory principles such as distribution of information, pattern completion, Hebbian learning, and noise tolerance (limit cycles, one to many associations, chaos, etc). Because of this, it can be argued that the model possesses more cognitive explanative power than other RL or hybrid models. It may be an effective tool for bridging the gap between biological memory models and computational memory models.


Sign in / Sign up

Export Citation Format

Share Document