Criticality-based Varying Step-number Algorithm for Reinforcement Learning

In the context of reinforcement learning we introduce the concept of criticality of a state, which indicates the extent to which the choice of action in that particular state influences the expected return. That is, a state in which the choice of action is more likely to influence the final outcome is considered as more critical than a state in which it is less likely to influence the final outcome. We formulate a criticality-based varying step number algorithm (CVS) — a flexible step number algorithm that utilizes the criticality function provided by a human, or learned directly from the environment. We test it in three different domains including the Atari Pong environment, Road-Tree environment, and Shooter environment. We demonstrate that CVS is able to outperform popular learning algorithms such as Deep Q-Learning and Monte Carlo.

Download Full-text

Deep Reinforcement Learning by Balancing Offline Monte Carlo and Online Temporal Difference Use Based on Environment Experiences

Symmetry ◽

10.3390/sym12101685 ◽

2020 ◽

Vol 12 (10) ◽

pp. 1685 ◽

Cited By ~ 1

Author(s):

Chayoung Kim

Keyword(s):

Monte Carlo ◽

Reinforcement Learning ◽

Real Time ◽

Temporal Difference ◽

Q Learning ◽

State Action ◽

Proposed Model ◽

Reward Functions ◽

And Performance ◽

The Internet Of Things

Owing to the complexity involved in training an agent in a real-time environment, e.g., using the Internet of Things (IoT), reinforcement learning (RL) using a deep neural network, i.e., deep reinforcement learning (DRL) has been widely adopted on an online basis without prior knowledge and complicated reward functions. DRL can handle a symmetrical balance between bias and variance—this indicates that the RL agents are competently trained in real-world applications. The approach of the proposed model considers the combinations of basic RL algorithms with online and offline use based on the empirical balances of bias–variance. Therefore, we exploited the balance between the offline Monte Carlo (MC) technique and online temporal difference (TD) with on-policy (state-action–reward-state-action, Sarsa) and an off-policy (Q-learning) in terms of a DRL. The proposed balance of MC (offline) and TD (online) use, which is simple and applicable without a well-designed reward, is suitable for real-time online learning. We demonstrated that, for a simple control task, the balance between online and offline use without an on- and off-policy shows satisfactory results. However, in complex tasks, the results clearly indicate the effectiveness of the combined method in improving the convergence speed and performance in a deep Q-network.

Download Full-text

A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms

Neural Computation ◽

10.1162/089976699300016070 ◽

1999 ◽

Vol 11 (8) ◽

pp. 2017-2060 ◽

Cited By ~ 70

Author(s):

Csaba Szepesvári ◽

Michael L. Littman

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Learning Algorithm ◽

Learning Algorithms ◽

Sequential Decision ◽

Q Learning ◽

Markov Games ◽

Optimal Behavior ◽

Risk Sensitive ◽

Optimal Value

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.

Download Full-text

Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning

Journal of Artificial Intelligence Research ◽

10.1613/jair.135 ◽

1995 ◽

Vol 2 ◽

pp. 287-318 ◽

Cited By ~ 22

Author(s):

P. Cichosz

Keyword(s):

Reinforcement Learning ◽

Traditional Approach ◽

Learning Algorithms ◽

Important Application ◽

Credit Assignment ◽

Function Representation ◽

Temporal Differences ◽

Q Learning ◽

Td Methods ◽

Prediction Problems

Temporal difference (TD) methods constitute a class of methods for learning predictions in multi-step prediction problems, parameterized by a recency factor lambda. Currently the most important application of these methods is to temporal credit assignment in reinforcement learning. Well known reinforcement learning algorithms, such as AHC or Q-learning, may be viewed as instances of TD learning. This paper examines the issues of the efficient and general implementation of TD(lambda) for arbitrary lambda, for use with reinforcement learning algorithms optimizing the discounted sum of rewards. The traditional approach, based on eligibility traces, is argued to suffer from both inefficiency and lack of generality. The TTD (Truncated Temporal Differences) procedure is proposed as an alternative, that indeed only approximates TD(lambda), but requires very little computation per action and can be used with arbitrary function representation methods. The idea from which it is derived is fairly simple and not new, but probably unexplored so far. Encouraging experimental results are presented, suggesting that using lambda > 0 with the TTD procedure allows one to obtain a significant learning speedup at essentially the same cost as usual TD(0) learning.

Download Full-text

Iterative SARSA: The Modified SARSA Algorithm for Finding the Optimal Path

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f9429.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 4333-4338

Keyword(s):

Reinforcement Learning ◽

Comparative Analysis ◽

Mobile Robots ◽

Learning Algorithm ◽

Learning Algorithms ◽

Optimal Path ◽

Autonomous Mobile Robots ◽

Current Standard ◽

Q Learning ◽

Better Than

This paper presents a thorough comparative analysis of various reinforcement learning algorithms used by autonomous mobile robots for optimal path finding and, we propose a new algorithm called Iterative SARSA for the same. The main objective of the paper is to differentiate between the Q-learning and SARSA, and modify the latter. These algorithms use either the on-policy or off-policy methods of reinforcement learning. For the on-policy method, we have used the SARSA algorithm and for the off-policy method, the Q-learning algorithm has been used. These algorithms also have an impacting effect on finding the shortest path possible for the robot. Based on the results obtained, we have concluded how our algorithm is better than the current standard reinforcement learning algorithms

Download Full-text

A line follower robot implementation using Lego's Mindstorms Kit and Q-Learning

Acta Universitaria ◽

10.15174/au.2012.350 ◽

2012 ◽

Vol 22 ◽

pp. 113-118 ◽

Cited By ~ 1

Author(s):

Víctor Ricardo Cruz-Álvarez ◽

Enrique Hidalgo-Peña ◽

Hector-Gabriel Acosta-Mesa

Keyword(s):

Reinforcement Learning ◽

Mobile Robots ◽

Learning Algorithm ◽

Learning Algorithms ◽

Programming Environment ◽

Q Learning

A common problem working with mobile robots is that programming phase could be a long, expensive and heavy process for programmers. The reinforcement learning algorithms offer one of the most general frameworks in learning subjects. This work presents an approach using the Q-Learning algorithm on a Lego robot in order for it to learn by itself how to follow a blackline drawn down on a white surface, using Matlab [5] as programming environment.

Download Full-text

When does reinforcement learning stand out in quantum control? A comparative study on state preparation

npj Quantum Information ◽

10.1038/s41534-019-0201-8 ◽

2019 ◽

Vol 5 (1) ◽

Cited By ~ 9

Author(s):

Xiao-Ming Zhang ◽

Zezhu Wei ◽

Raza Asad ◽

Xu-Chen Yang ◽

Xin Wang

Keyword(s):

Machine Learning ◽

Reinforcement Learning ◽

Comparative Study ◽

Gradient Descent ◽

Quantum Control ◽

Learning Algorithms ◽

Stochastic Gradient Descent ◽

Q Learning ◽

Machine Learning Methods ◽

Policy Gradient

Abstract Reinforcement learning has been widely used in many problems, including quantum control of qubits. However, such problems can, at the same time, be solved by traditional, non-machine-learning methods, such as stochastic gradient descent and Krotov algorithms, and it remains unclear which one is most suitable when the control has specific constraints. In this work, we perform a comparative study on the efficacy of three reinforcement learning algorithms: tabular Q-learning, deep Q-learning, and policy gradient, as well as two non-machine-learning methods: stochastic gradient descent and Krotov algorithms, in the problem of preparing a desired quantum state. We found that overall, the deep Q-learning and policy gradient algorithms outperform others when the problem is discretized, e.g. allowing discrete values of control, and when the problem scales up. The reinforcement learning algorithms can also adaptively reduce the complexity of the control sequences, shortening the operation time and improving the fidelity. Our comparison provides insights into the suitability of reinforcement learning in quantum control problems.

Download Full-text

Hybrid Online and Offline Reinforcement Learning for Tibetan Jiu Chess

Complexity ◽

10.1155/2020/4708075 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Xiali Li ◽

Zhengyu Lv ◽

Licheng Wu ◽

Yue Zhao ◽

Xiaona Xu

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Domain Knowledge ◽

Deep Neural Network ◽

Learning Strategy ◽

Learning Algorithms ◽

Hybrid State ◽

Q Learning ◽

State Action ◽

Upper Confidence Bound

In this study, hybrid state-action-reward-state-action (SARSAλ) and Q-learning algorithms are applied to different stages of an upper confidence bound applied to tree search for Tibetan Jiu chess. Q-learning is also used to update all the nodes on the search path when each game ends. A learning strategy that uses SARSAλ and Q-learning algorithms combining domain knowledge for a feedback function for layout and battle stages is proposed. An improved deep neural network based on ResNet18 is used for self-play training. Experimental results show that hybrid online and offline reinforcement learning with a deep neural network can improve the game program’s learning efficiency and understanding ability for Tibetan Jiu chess.

Download Full-text

Minibatch Recursive Least Squares Q-Learning

Computational Intelligence and Neuroscience ◽

10.1155/2021/5370281 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Chunyuan Zhang ◽

Qi Song ◽

Zeng Meng

Keyword(s):

Reinforcement Learning ◽

Least Squares ◽

Linear Function ◽

Function Approximation ◽

Learning Algorithm ◽

Learning Algorithms ◽

Optimization Technique ◽

Recursive Least Squares ◽

Q Learning ◽

Linear Function Approximation

The deep Q-network (DQN) is one of the most successful reinforcement learning algorithms, but it has some drawbacks such as slow convergence and instability. In contrast, the traditional reinforcement learning algorithms with linear function approximation usually have faster convergence and better stability, although they easily suffer from the curse of dimensionality. In recent years, many improvements to DQN have been made, but they seldom make use of the advantage of traditional algorithms to improve DQN. In this paper, we propose a novel Q-learning algorithm with linear function approximation, called the minibatch recursive least squares Q-learning (MRLS-Q). Different from the traditional Q-learning algorithm with linear function approximation, the learning mechanism and model structure of MRLS-Q are more similar to those of DQNs with only one input layer and one linear output layer. It uses the experience replay and the minibatch training mode and uses the agent’s states rather than the agent’s state-action pairs as the inputs. As a result, it can be used alone for low-dimensional problems and can be seamlessly integrated into DQN as the last layer for high-dimensional problems as well. In addition, MRLS-Q uses our proposed average RLS optimization technique, so that it can achieve better convergence performance whether it is used alone or integrated with DQN. At the end of this paper, we demonstrate the effectiveness of MRLS-Q on the CartPole problem and four Atari games and investigate the influences of its hyperparameters experimentally.

Download Full-text

Reinforcement learning in discrete and continuous domains applied to ship trajectory generation

Polish Maritime Research ◽

10.2478/v10012-012-0020-8 ◽

2012 ◽

Vol 19 (Special) ◽

pp. 31-36 ◽

Cited By ~ 2

Author(s):

Andrzej Rak ◽

Witold Gierusz

Keyword(s):

Reinforcement Learning ◽

Least Squares ◽

Learning Algorithm ◽

Learning Algorithms ◽

Trajectory Generation ◽

Decision Processes ◽

The Other ◽

Q Learning ◽

Continuous Domains

ABSTRACT This paper presents the application of the reinforcement learning algorithms to the task of autonomous determination of the ship trajectory during the in-harbour and harbour approaching manoeuvres. Authors used Markov decision processes formalism to build up the background of algorithm presentation. Two versions of RL algorithms were tested in the simulations: discrete (Q-learning) and continuous form (Least-Squares Policy Iteration). The results show that in both cases ship trajectory can be found. However discrete Q-learning algorithm suffered from many limitations (mainly curse of dimensionality) and practically is not applicable to the examined task. On the other hand, LSPI gave promising results. To be fully operational, proposed solution should be extended by taking into account ship heading and velocity and coupling with advanced multi-variable controller.

Download Full-text

Choice of cargo delivery option in multimodal connection based on reinforcement learning

Journal of Physics Conference Series ◽

10.1088/1742-6596/2131/3/032103 ◽

2021 ◽

Vol 2131 (3) ◽

pp. 032103

Author(s):

A P Badetskii ◽

O A Medved

Keyword(s):

Artificial Intelligence ◽

Reinforcement Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Digital Technologies ◽

Qualitative Assessment ◽

Q Learning ◽

Cargo Delivery ◽

Route Option

Abstract The article discusses the issues of choosing a route and an option of cargo flows in multimodal connection in modern conditions. Taking into account active development of artificial intelligence and digital technologies in all types of production activities, it is proposed to use reinforcement learning algorithms to solve the problem. An analysis of the existing algorithms has been carried out, on the basis of which it was found that when choosing a route option for cargo in a multimodal connection, it would be useful to have a qualitative assessment of terminal states. To obtain such an estimate, the Q-learning algorithm was applied in the article, which showed sufficient convergence and efficiency.

Download Full-text