scholarly journals Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

Author(s):  
Thiago D. Simão ◽  
Matthijs T. J. Spaan

We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent’s behavior. Factored reinforcement learning, on the other hand, is known to make good use of the data provided. It can achieve a better sample complexity by exploiting independence between features of the environment, but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm.

2002 ◽  
Vol 16 ◽  
pp. 59-104 ◽  
Author(s):  
C. Drummond

This paper discusses a system that accelerates reinforcement learning by using transfer from related tasks. Without such transfer, even if two tasks are very similar at some abstract level, an extensive re-learning effort is required. The system achieves much of its power by transferring parts of previously learned solutions rather than a single complete solution. The system exploits strong features in the multi-dimensional function produced by reinforcement learning in solving a particular task. These features are stable and easy to recognize early in the learning process. They generate a partitioning of the state space and thus the function. The partition is represented as a graph. This is used to index and compose functions stored in a case base to form a close approximation to the solution of the new task. Experiments demonstrate that function composition often produces more than an order of magnitude increase in learning rate compared to a basic reinforcement learning algorithm.


2012 ◽  
Vol 19 (Special) ◽  
pp. 31-36 ◽  
Author(s):  
Andrzej Rak ◽  
Witold Gierusz

ABSTRACT This paper presents the application of the reinforcement learning algorithms to the task of autonomous determination of the ship trajectory during the in-harbour and harbour approaching manoeuvres. Authors used Markov decision processes formalism to build up the background of algorithm presentation. Two versions of RL algorithms were tested in the simulations: discrete (Q-learning) and continuous form (Least-Squares Policy Iteration). The results show that in both cases ship trajectory can be found. However discrete Q-learning algorithm suffered from many limitations (mainly curse of dimensionality) and practically is not applicable to the examined task. On the other hand, LSPI gave promising results. To be fully operational, proposed solution should be extended by taking into account ship heading and velocity and coupling with advanced multi-variable controller.


Author(s):  
Zongzhang Zhang ◽  
Zhiyuan Pan ◽  
Mykel J. Kochenderfer

Q-learning is a popular reinforcement learning algorithm, but it can perform poorly in stochastic environments due to overestimating action values. Overestimation is due to the use of a single estimator that uses the maximum action value as an approximation for the maximum expected action value. To avoid overestimation in Q-learning, the double Q-learning algorithm was recently proposed, which uses the double estimator method. It uses two estimators from independent sets of experiences, with one estimator determining the maximizing action and the other providing the estimate of its value. Double Q-learning sometimes underestimates the action values. This paper introduces a weighted double Q-learning algorithm, which is based on the construction of the weighted double estimator, with the goal of balancing between the overestimation in the single estimator and the underestimation in the double estimator. Empirically, the new algorithm is shown to perform well on several MDP problems.


Symmetry ◽  
2019 ◽  
Vol 11 (2) ◽  
pp. 290 ◽  
Author(s):  
SeungYoon Choi ◽  
Tuyen Le ◽  
Quang Nguyen ◽  
Md Layek ◽  
SeungGwan Lee ◽  
...  

In this paper, we propose a controller for a bicycle using the DDPG (Deep Deterministic Policy Gradient) algorithm, which is a state-of-the-art deep reinforcement learning algorithm. We use a reward function and a deep neural network to build the controller. By using the proposed controller, a bicycle can not only be stably balanced but also travel to any specified location. We confirm that the controller with DDPG shows better performance than the other baselines such as Normalized Advantage Function (NAF) and Proximal Policy Optimization (PPO). For the performance evaluation, we implemented the proposed algorithm in various settings such as fixed and random speed, start location, and destination location.


Robotica ◽  
2004 ◽  
Vol 22 (1) ◽  
pp. 29-39 ◽  
Author(s):  
Chee-Meng Chew ◽  
Gill A. Pratt

This paper presents two frontal plane algorithms for 3D dynamic bipedal walking. One of which is based on the notion of symmetry and the other uses reinforcement learning algorithm to learn the lateral foot placement. The algorithms are combined with a sagittal plane algorithm and successfully applied to a simulated 3D bipedal robot to achieve level ground walking. The simulation results showed that the choice of the local control law for the stance-ankle roll joint could significantly affect the performance of the frontal plane algorithms.


1994 ◽  
Vol 59 (6) ◽  
pp. 1439-1450 ◽  
Author(s):  
Miroslava Žertová ◽  
Jiřina Slaninová ◽  
Zdenko Procházka

An analysis of the uterotonic potencies of all analogs having substituted L- or D-tyrosine or -phenylalanine in position 2 and L-arginine, D-arginine or D-homoarginine in position 8 was made. The series of analogs already published was completed by the solid phase synthesis of ten new analogs having L- or D-Phe, L- or D-Phe(2-Et), L- or D-Phe(2,4,6-triMe) or D-Tyr(Me) in position 2 and either L- or D-arginine in position 8. All newly synthesized analogs were found to be uterotonic inhibitors. Deamination increases both the agonistic and antagonistic potency. In the case of phenylalanine analogs the change of configuration from L to D in position 2 enhances the uterotonic inhibition for more than 1 order of magnitude. The L to D change in position 8 enhances the inhibitory potency negligibly. Prolongation of the side chain of the D-basic amino acid in position 8 seems to decrease slightly the inhibitory potency if there is L-substituted amino acid in position 2. On the other hand there is a tendency to the increase of the inhibitory potency if there is D-substituted amino acid in position 2.


Symmetry ◽  
2021 ◽  
Vol 13 (3) ◽  
pp. 471
Author(s):  
Jai Hoon Park ◽  
Kang Hoon Lee

Designing novel robots that can cope with a specific task is a challenging problem because of the enormous design space that involves both morphological structures and control mechanisms. To this end, we present a computational method for automating the design of modular robots. Our method employs a genetic algorithm to evolve robotic structures as an outer optimization, and it applies a reinforcement learning algorithm to each candidate structure to train its behavior and evaluate its potential learning ability as an inner optimization. The size of the design space is reduced significantly by evolving only the robotic structure and by performing behavioral optimization using a separate training algorithm compared to that when both the structure and behavior are evolved simultaneously. Mutual dependence between evolution and learning is achieved by regarding the mean cumulative rewards of a candidate structure in the reinforcement learning as its fitness in the genetic algorithm. Therefore, our method searches for prospective robotic structures that can potentially lead to near-optimal behaviors if trained sufficiently. We demonstrate the usefulness of our method through several effective design results that were automatically generated in the process of experimenting with actual modular robotics kit.


Sign in / Sign up

Export Citation Format

Share Document