Variational Bayesian Exploration-Based Active Sarsa Algorithm

Author(s):  
Qiming Fu ◽  
Zhengxia Yang ◽  
You Lu ◽  
Hongjie Wu ◽  
Fuyuan Hu ◽  
...  

We proposed an improved variational Bayesian exploration-based active Sarsa (VBE-ASAR) algorithm, which tries to balance the exploration and exploitation dilemma, and speeds up the convergence rate. First, in the learning process, variational Bayesian method is adopted to measure the information gain, which is used as an exploration factor to construct an internal reward function for heuristic exploration. In addition, before the learning process, in order to improve the exploration performance, transfer learning is used to initialize the value function, where Bisimulation metric is introduced to measure the distance between two states from the source MDP and the target MDP, respectively. Finally, we apply the proposed algorithm to the cliff walking problem, and compare with the Sarsa algorithm, the Q-Learning algorithm, the VFT-Sarsa algorithm and the Bayesian Sarsa (BS) algorithm. Experimental results show that the VBE-ASAR algorithm has a faster learning rate.

Author(s):  
Arpita Chakraborty ◽  
Jyoti Sekhar Banerjee

The goal of this paper is to improve the performance of the well known Q learning algorithm, the robust technique of Machine learning to facilitate path planning in an environment. Until this time the Q learning algorithms like Classical Q learning(CQL)algorithm and Improved Q learning (IQL) algorithm deal with an environment without obstacles, while in a real environment an agent has to face obstacles very frequently. Hence this paper considers an environment with number of obstacles and has coined a new parameter, called ‘immediate penalty’ due to collision with an obstacle. Further the proposed technique has replaced the scalar ‘immediate reward’ function by ‘effective immediate reward’ function which consists of two fuzzy parameters named as, ‘immediate reward’ and ‘immediate penalty’. The fuzzification of these two important parameters not only improves the learning technique, it also strikes a balance between exploration and exploitation, the most challenging problem of Reinforcement Learning. The proposed algorithm stores the Q value for the best possible action at a state; as well it saves significant path planning time by suggesting the best action to adopt at each state to move to the next state. Eventually, the agent becomes more intelligent as it can smartly plan a collision free path avoiding obstacles from distance. The validation of the algorithm is studied through computer simulation in a maze like environment and also on KheperaII platform in real time. An analysis reveals that the Q Table, obtained by the proposed Advanced Q learning (AQL) algorithm, when used for path-planning application of mobile robots outperforms the classical and improved Q-learning.


2010 ◽  
Vol 42 (1) ◽  
pp. 158-182 ◽  
Author(s):  
Kurt Helmes ◽  
Richard H. Stockbridge

A new approach to the solution of optimal stopping problems for one-dimensional diffusions is developed. It arises by imbedding the stochastic problem in a linear programming problem over a space of measures. Optimizing over a smaller class of stopping rules provides a lower bound on the value of the original problem. Then the weak duality of a restricted form of the dual linear program provides an upper bound on the value. An explicit formula for the reward earned using a two-point hitting time stopping rule allows us to prove strong duality between these problems and, therefore, allows us to either optimize over these simpler stopping rules or to solve the restricted dual program. Each optimization problem is parameterized by the initial value of the diffusion and, thus, we are able to construct the value function by solving the family of optimization problems. This methodology requires little regularity of the terminal reward function. When the reward function is smooth, the optimal stopping locations are shown to satisfy the smooth pasting principle. The procedure is illustrated using two examples.


1999 ◽  
Vol 11 (8) ◽  
pp. 2017-2060 ◽  
Author(s):  
Csaba Szepesvári ◽  
Michael L. Littman

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.


2005 ◽  
Vol 24 ◽  
pp. 81-108 ◽  
Author(s):  
P. Geibel ◽  
F. Wysotzki

In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.


2012 ◽  
Vol 433-440 ◽  
pp. 6033-6037
Author(s):  
Xiao Ming Liu ◽  
Xiu Ying Wang

The movement characteristics of traffic flow nearby have the important influence on the main line. The control method of expressway off-ramp based on Q-learning and extension control is established by analyzing parameters of off-ramp and auxiliary road. First, the basic description of Q-learning algorithm and extension control is given and analyzed necessarily. Then reward function is gained through the extension control theory to judge the state of traffic light. Simulation results show that compared to the queue lengths of off-ramp and auxiliary road, control method based on Q-learning algorithm and extension control greatly reduced queue length of off-ramp, which demonstrates the feasibility of control strategies.


2010 ◽  
Vol 42 (01) ◽  
pp. 158-182 ◽  
Author(s):  
Kurt Helmes ◽  
Richard H. Stockbridge

A new approach to the solution of optimal stopping problems for one-dimensional diffusions is developed. It arises by imbedding the stochastic problem in a linear programming problem over a space of measures. Optimizing over a smaller class of stopping rules provides a lower bound on the value of the original problem. Then the weak duality of a restricted form of the dual linear program provides an upper bound on the value. An explicit formula for the reward earned using a two-point hitting time stopping rule allows us to prove strong duality between these problems and, therefore, allows us to either optimize over these simpler stopping rules or to solve the restricted dual program. Each optimization problem is parameterized by the initial value of the diffusion and, thus, we are able to construct the value function by solving the family of optimization problems. This methodology requires little regularity of the terminal reward function. When the reward function is smooth, the optimal stopping locations are shown to satisfy the smooth pasting principle. The procedure is illustrated using two examples.


2019 ◽  
Vol 16 (3) ◽  
pp. 172988141985318
Author(s):  
Zhenhai Gao ◽  
Tianjun Sun ◽  
Hongwei Xiao

In the development of autonomous driving, decision-making has become one of the technical difficulties. Traditional rule-based decision-making methods lack adaptive capacity when dealing with unfamiliar and complex traffic conditions. However, reinforcement learning shows the potential to solve sequential decision problems. In this article, an independent decision-making method based on reinforcement Q-learning is proposed. First, a Markov decision process model is established by analysis of car-following. Then, the state set and action set are designed by the synthesized consideration of driving simulator experimental results and driving risk principles. Furthermore, the reinforcement Q-learning algorithm is developed mainly based on the reward function and update function. Finally, the feasibility is verified through random simulation tests, and the improvement is made by comparative analysis with a traditional method.


2020 ◽  
Vol 0 (0) ◽  
Author(s):  
Qiangang Zheng ◽  
Zhihua Xi ◽  
Chunping Hu ◽  
Haibo ZHANG ◽  
Zhongzhi Hu

AbstractFor improving the response performance of engine, a novel aero-engine control method based on Deep Q Learning (DQL) is proposed. The engine controller based on DQL has been designed. The model free algorithm – Q learning, which can be performed online, is adopted to calculate the action value function. To improve the learning capacity of DQL, the deep learning algorithm – On Line Sliding Window Deep Neural Network (OL-SW-DNN), is adopted to estimate the action value function. For reducing the sensitivity to the noise of training data, OL-SW-DNN selects nearest point data of certain length as training data. Finally, the engine acceleration simulations of DQR and the Proportion Integration Differentiation (PID) which is the most commonly used as engine controller algorithm in industry are both conducted to verify the validity of the proposed method. The results show that the acceleration time of the proposed method decreased by 1.475 second while satisfied all of engine limits compared with the tradition controller.


2020 ◽  
Vol 30 (09) ◽  
pp. 2050048
Author(s):  
Bo-Wei Chen ◽  
Shih-Hung Yang ◽  
Yu-Chun Lo ◽  
Ching-Fu Wang ◽  
Han-Lin Wang ◽  
...  

Hippocampal place cells and interneurons in mammals have stable place fields and theta phase precession profiles that encode spatial environmental information. Hippocampal CA1 neurons can represent the animal’s location and prospective information about the goal location. Reinforcement learning (RL) algorithms such as Q-learning have been used to build the navigation models. However, the traditional Q-learning ([Formula: see text]Q-learning) limits the reward function once the animals arrive at the goal location, leading to unsatisfactory location accuracy and convergence rates. Therefore, we proposed a revised version of the Q-learning algorithm, dynamical Q-learning ([Formula: see text]Q-learning), which assigns the reward function adaptively to improve the decoding performance. Firing rate was the input of the neural network of [Formula: see text]Q-learning and was used to predict the movement direction. On the other hand, phase precession was the input of the reward function to update the weights of [Formula: see text]Q-learning. Trajectory predictions using [Formula: see text]Q- and [Formula: see text]Q-learning were compared by the root mean squared error (RMSE) between the actual and predicted rat trajectories. Using [Formula: see text]Q-learning, significantly higher prediction accuracy and faster convergence rate were obtained compared with [Formula: see text]Q-learning in all cell types. Moreover, combining place cells and interneurons with theta phase precession improved the convergence rate and prediction accuracy. The proposed [Formula: see text]Q-learning algorithm is a quick and more accurate method to perform trajectory reconstruction and prediction.


Sign in / Sign up

Export Citation Format

Share Document