Variational Bayesian Exploration-Based Active Sarsa Algorithm

The goal of this paper is to improve the performance of the well known Q learning algorithm, the robust technique of Machine learning to facilitate path planning in an environment. Until this time the Q learning algorithms like Classical Q learning(CQL)algorithm and Improved Q learning (IQL) algorithm deal with an environment without obstacles, while in a real environment an agent has to face obstacles very frequently. Hence this paper considers an environment with number of obstacles and has coined a new parameter, called ‘immediate penalty’ due to collision with an obstacle. Further the proposed technique has replaced the scalar ‘immediate reward’ function by ‘effective immediate reward’ function which consists of two fuzzy parameters named as, ‘immediate reward’ and ‘immediate penalty’. The fuzzification of these two important parameters not only improves the learning technique, it also strikes a balance between exploration and exploitation, the most challenging problem of Reinforcement Learning. The proposed algorithm stores the Q value for the best possible action at a state; as well it saves significant path planning time by suggesting the best action to adopt at each state to move to the next state. Eventually, the agent becomes more intelligent as it can smartly plan a collision free path avoiding obstacles from distance. The validation of the algorithm is studied through computer simulation in a maze like environment and also on KheperaII platform in real time. An analysis reveals that the Q Table, obtained by the proposed Advanced Q learning (AQL) algorithm, when used for path-planning application of mobile robots outperforms the classical and improved Q-learning.

Download Full-text

Solving flow-shop scheduling problem with a reinforcement learning algorithm that generalizes the value function with neural network

Alexandria Engineering Journal ◽

10.1016/j.aej.2021.01.030 ◽

2021 ◽

Vol 60 (3) ◽

pp. 2787-2800

Author(s):

Jianfeng Ren ◽

Chunming Ye ◽

Feng Yang

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Value Function ◽

Flow Shop ◽

Learning Algorithm ◽

Flow Shop Scheduling ◽

Scheduling Problem ◽

Shop Scheduling ◽

The Value Function ◽

Reinforcement Learning Algorithm

Download Full-text

Construction of the Value Function and Optimal Rules in Optimal Stopping of One-Dimensional Diffusions

Advances in Applied Probability ◽

10.1239/aap/1269611148 ◽

2010 ◽

Vol 42 (1) ◽

pp. 158-182 ◽

Cited By ~ 9

Author(s):

Kurt Helmes ◽

Richard H. Stockbridge

Keyword(s):

Optimal Stopping ◽

Value Function ◽

Optimization Problems ◽

Strong Duality ◽

Stopping Rules ◽

One Dimensional ◽

Reward Function ◽

Restricted Form ◽

The Family ◽

The Value Function

A new approach to the solution of optimal stopping problems for one-dimensional diffusions is developed. It arises by imbedding the stochastic problem in a linear programming problem over a space of measures. Optimizing over a smaller class of stopping rules provides a lower bound on the value of the original problem. Then the weak duality of a restricted form of the dual linear program provides an upper bound on the value. An explicit formula for the reward earned using a two-point hitting time stopping rule allows us to prove strong duality between these problems and, therefore, allows us to either optimize over these simpler stopping rules or to solve the restricted dual program. Each optimization problem is parameterized by the initial value of the diffusion and, thus, we are able to construct the value function by solving the family of optimization problems. This methodology requires little regularity of the terminal reward function. When the reward function is smooth, the optimal stopping locations are shown to satisfy the smooth pasting principle. The procedure is illustrated using two examples.

Download Full-text

A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms

Neural Computation ◽

10.1162/089976699300016070 ◽

1999 ◽

Vol 11 (8) ◽

pp. 2017-2060 ◽

Cited By ~ 70

Author(s):

Csaba Szepesvári ◽

Michael L. Littman

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Learning Algorithm ◽

Learning Algorithms ◽

Sequential Decision ◽

Q Learning ◽

Markov Games ◽

Optimal Behavior ◽

Risk Sensitive ◽

Optimal Value

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.

Download Full-text

Risk-Sensitive Reinforcement Learning Applied to Control under Constraints

Journal of Artificial Intelligence Research ◽

10.1613/jair.1666 ◽

2005 ◽

Vol 24 ◽

pp. 81-108 ◽

Cited By ~ 65

Author(s):

P. Geibel ◽

F. Wysotzki

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Learning Algorithm ◽

Optimal Solution ◽

Feed Tank ◽

Model Free ◽

Constrained Problem ◽

Risk Sensitive ◽

Markov Decision ◽

The Value Function

In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.

Download Full-text

Research on Control Method of Expressway Off-Ramp Based on Q-Learning Algorithm and Extension Control

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.433-440.6033 ◽

2012 ◽

Vol 433-440 ◽

pp. 6033-6037

Author(s):

Xiao Ming Liu ◽

Xiu Ying Wang

Keyword(s):

Queue Length ◽

Control Method ◽

Learning Algorithm ◽

Control Strategies ◽

Traffic Light ◽

Q Learning ◽

Reward Function ◽

Movement Characteristics ◽

Queue Lengths ◽

Simulation Results

The movement characteristics of traffic flow nearby have the important influence on the main line. The control method of expressway off-ramp based on Q-learning and extension control is established by analyzing parameters of off-ramp and auxiliary road. First, the basic description of Q-learning algorithm and extension control is given and analyzed necessarily. Then reward function is gained through the extension control theory to judge the state of traffic light. Simulation results show that compared to the queue lengths of off-ramp and auxiliary road, control method based on Q-learning algorithm and extension control greatly reduced queue length of off-ramp, which demonstrates the feasibility of control strategies.

Download Full-text

Construction of the Value Function and Optimal Rules in Optimal Stopping of One-Dimensional Diffusions

Advances in Applied Probability ◽

10.1017/s0001867800003955 ◽

2010 ◽

Vol 42 (01) ◽

pp. 158-182 ◽

Cited By ~ 3

Author(s):

Kurt Helmes ◽

Richard H. Stockbridge

Keyword(s):

Optimal Stopping ◽

Value Function ◽

Optimization Problems ◽

Strong Duality ◽

Stopping Rules ◽

One Dimensional ◽

Reward Function ◽

Restricted Form ◽

The Family ◽

The Value Function

A new approach to the solution of optimal stopping problems for one-dimensional diffusions is developed. It arises by imbedding the stochastic problem in a linear programming problem over a space of measures. Optimizing over a smaller class of stopping rules provides a lower bound on the value of the original problem. Then the weak duality of a restricted form of the dual linear program provides an upper bound on the value. An explicit formula for the reward earned using a two-point hitting time stopping rule allows us to prove strong duality between these problems and, therefore, allows us to either optimize over these simpler stopping rules or to solve the restricted dual program. Each optimization problem is parameterized by the initial value of the diffusion and, thus, we are able to construct the value function by solving the family of optimization problems. This methodology requires little regularity of the terminal reward function. When the reward function is smooth, the optimal stopping locations are shown to satisfy the smooth pasting principle. The procedure is illustrated using two examples.

Download Full-text

Decision-making method for vehicle longitudinal automatic driving based on reinforcement Q-learning

International Journal of Advanced Robotic Systems ◽

10.1177/1729881419853185 ◽

2019 ◽

Vol 16 (3) ◽

pp. 172988141985318

Author(s):

Zhenhai Gao ◽

Tianjun Sun ◽

Hongwei Xiao

Keyword(s):

Decision Making ◽

Process Model ◽

Driving Simulator ◽

Learning Algorithm ◽

Autonomous Driving ◽

Sequential Decision ◽

Q Learning ◽

Reward Function ◽

Automatic Driving ◽

Independent Decision

In the development of autonomous driving, decision-making has become one of the technical difficulties. Traditional rule-based decision-making methods lack adaptive capacity when dealing with unfamiliar and complex traffic conditions. However, reinforcement learning shows the potential to solve sequential decision problems. In this article, an independent decision-making method based on reinforcement Q-learning is proposed. First, a Markov decision process model is established by analysis of car-following. Then, the state set and action set are designed by the synthesized consideration of driving simulator experimental results and driving risk principles. Furthermore, the reinforcement Q-learning algorithm is developed mainly based on the reward function and update function. Finally, the feasibility is verified through random simulation tests, and the improvement is made by comparative analysis with a traditional method.

Download Full-text

A Research on Aero-engine Control Based on Deep Q Learning

International Journal of Turbo and Jet Engines ◽

10.1515/tjj-2020-0009 ◽

2020 ◽

Vol 0 (0) ◽

Author(s):

Qiangang Zheng ◽

Zhihua Xi ◽

Chunping Hu ◽

Haibo ZHANG ◽

Zhongzhi Hu

Keyword(s):

Value Function ◽

Control Method ◽

Learning Algorithm ◽

Training Data ◽

Engine Control ◽

Q Learning ◽

Model Free ◽

Deep Learning Algorithm ◽

Aero Engine ◽

Action Value

AbstractFor improving the response performance of engine, a novel aero-engine control method based on Deep Q Learning (DQL) is proposed. The engine controller based on DQL has been designed. The model free algorithm – Q learning, which can be performed online, is adopted to calculate the action value function. To improve the learning capacity of DQL, the deep learning algorithm – On Line Sliding Window Deep Neural Network (OL-SW-DNN), is adopted to estimate the action value function. For reducing the sensitivity to the noise of training data, OL-SW-DNN selects nearest point data of certain length as training data. Finally, the engine acceleration simulations of DQR and the Proportion Integration Differentiation (PID) which is the most commonly used as engine controller algorithm in industry are both conducted to verify the validity of the proposed method. The results show that the acceleration time of the proposed method decreased by 1.475 second while satisfied all of engine limits compared with the tradition controller.

Download Full-text

Enhancement of Hippocampal Spatial Decoding Using a Dynamic Q-Learning Method With a Relative Reward Using Theta Phase Precession

International Journal of Neural Systems ◽

10.1142/s0129065720500483 ◽

2020 ◽

Vol 30 (09) ◽

pp. 2050048

Author(s):

Bo-Wei Chen ◽

Shih-Hung Yang ◽

Yu-Chun Lo ◽

Ching-Fu Wang ◽

Han-Lin Wang ◽

...

Keyword(s):

Convergence Rate ◽

Prediction Accuracy ◽

Learning Algorithm ◽

Place Cells ◽

Goal Location ◽

Q Learning ◽

Reward Function ◽

Phase Precession ◽

Theta Phase ◽

Theta Phase Precession

Hippocampal place cells and interneurons in mammals have stable place fields and theta phase precession profiles that encode spatial environmental information. Hippocampal CA1 neurons can represent the animal’s location and prospective information about the goal location. Reinforcement learning (RL) algorithms such as Q-learning have been used to build the navigation models. However, the traditional Q-learning ([Formula: see text]Q-learning) limits the reward function once the animals arrive at the goal location, leading to unsatisfactory location accuracy and convergence rates. Therefore, we proposed a revised version of the Q-learning algorithm, dynamical Q-learning ([Formula: see text]Q-learning), which assigns the reward function adaptively to improve the decoding performance. Firing rate was the input of the neural network of [Formula: see text]Q-learning and was used to predict the movement direction. On the other hand, phase precession was the input of the reward function to update the weights of [Formula: see text]Q-learning. Trajectory predictions using [Formula: see text]Q- and [Formula: see text]Q-learning were compared by the root mean squared error (RMSE) between the actual and predicted rat trajectories. Using [Formula: see text]Q-learning, significantly higher prediction accuracy and faster convergence rate were obtained compared with [Formula: see text]Q-learning in all cell types. Moreover, combining place cells and interneurons with theta phase precession improved the convergence rate and prediction accuracy. The proposed [Formula: see text]Q-learning algorithm is a quick and more accurate method to perform trajectory reconstruction and prediction.

Download Full-text