Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014967 ◽

2019 ◽

Vol 33 ◽

pp. 4967-4974 ◽

Cited By ~ 2

Author(s):

Thiago D. Simão ◽

Matthijs T. J. Spaan

Keyword(s):

Reinforcement Learning ◽

Confidence Level ◽

Learning Algorithm ◽

The Other ◽

Sample Complexity ◽

Policy Improvement ◽

Expected Performance ◽

Order Of Magnitude ◽

Past Experiences ◽

The Empirical Analysis

We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent’s behavior. Factored reinforcement learning, on the other hand, is known to make good use of the data provided. It can achieve a better sample complexity by exploiting independence between features of the environment, but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm.

Download Full-text

Accelerating Reinforcement Learning by Composing Solutions of Automatically Identified Subtasks

Journal of Artificial Intelligence Research ◽

10.1613/jair.904 ◽

2002 ◽

Vol 16 ◽

pp. 59-104 ◽

Cited By ~ 24

Author(s):

C. Drummond

Keyword(s):

Reinforcement Learning ◽

State Space ◽

Learning Process ◽

Learning Algorithm ◽

Complete Solution ◽

Learning Rate ◽

Close Approximation ◽

Order Of Magnitude ◽

Abstract Level ◽

Magnitude Increase

This paper discusses a system that accelerates reinforcement learning by using transfer from related tasks. Without such transfer, even if two tasks are very similar at some abstract level, an extensive re-learning effort is required. The system achieves much of its power by transferring parts of previously learned solutions rather than a single complete solution. The system exploits strong features in the multi-dimensional function produced by reinforcement learning in solving a particular task. These features are stable and easy to recognize early in the learning process. They generate a partitioning of the state space and thus the function. The partition is represented as a graph. This is used to index and compose functions stored in a case base to form a close approximation to the solution of the new task. Experiments demonstrate that function composition often produces more than an order of magnitude increase in learning rate compared to a basic reinforcement learning algorithm.

Download Full-text

Reinforcement learning in discrete and continuous domains applied to ship trajectory generation

Polish Maritime Research ◽

10.2478/v10012-012-0020-8 ◽

2012 ◽

Vol 19 (Special) ◽

pp. 31-36 ◽

Cited By ~ 2

Author(s):

Andrzej Rak ◽

Witold Gierusz

Keyword(s):

Reinforcement Learning ◽

Least Squares ◽

Learning Algorithm ◽

Learning Algorithms ◽

Trajectory Generation ◽

Decision Processes ◽

The Other ◽

Q Learning ◽

Continuous Domains

ABSTRACT This paper presents the application of the reinforcement learning algorithms to the task of autonomous determination of the ship trajectory during the in-harbour and harbour approaching manoeuvres. Authors used Markov decision processes formalism to build up the background of algorithm presentation. Two versions of RL algorithms were tested in the simulations: discrete (Q-learning) and continuous form (Least-Squares Policy Iteration). The results show that in both cases ship trajectory can be found. However discrete Q-learning algorithm suffered from many limitations (mainly curse of dimensionality) and practically is not applicable to the examined task. On the other hand, LSPI gave promising results. To be fully operational, proposed solution should be extended by taking into account ship heading and velocity and coupling with advanced multi-variable controller.

Download Full-text

Weighted Double Q-learning

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/483 ◽

2017 ◽

Cited By ~ 9

Author(s):

Zongzhang Zhang ◽

Zhiyuan Pan ◽

Mykel J. Kochenderfer

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Independent Sets ◽

The Other ◽

Q Learning ◽

Stochastic Environments ◽

Action Value ◽

Reinforcement Learning Algorithm

Q-learning is a popular reinforcement learning algorithm, but it can perform poorly in stochastic environments due to overestimating action values. Overestimation is due to the use of a single estimator that uses the maximum action value as an approximation for the maximum expected action value. To avoid overestimation in Q-learning, the double Q-learning algorithm was recently proposed, which uses the double estimator method. It uses two estimators from independent sets of experiences, with one estimator determining the maximizing action and the other providing the estimate of its value. Double Q-learning sometimes underestimates the action values. This paper introduces a weighted double Q-learning algorithm, which is based on the construction of the weighted double estimator, with the goal of balancing between the overestimation in the single estimator and the underestimation in the double estimator. Empirically, the new algorithm is shown to perform well on several MDP problems.

Download Full-text

Toward Self-Driving Bicycles Using State-of-the-Art Deep Reinforcement Learning Algorithms

Symmetry ◽

10.3390/sym11020290 ◽

2019 ◽

Vol 11 (2) ◽

pp. 290 ◽

Cited By ~ 4

Author(s):

SeungYoon Choi ◽

Tuyen Le ◽

Quang Nguyen ◽

Md Layek ◽

SeungGwan Lee ◽

...

Keyword(s):

Reinforcement Learning ◽

Deep Neural Network ◽

Learning Algorithm ◽

State Of The Art ◽

The Other ◽

Gradient Algorithm ◽

Reward Function ◽

Policy Gradient ◽

Policy Optimization ◽

Start Location

In this paper, we propose a controller for a bicycle using the DDPG (Deep Deterministic Policy Gradient) algorithm, which is a state-of-the-art deep reinforcement learning algorithm. We use a reward function and a deep neural network to build the controller. By using the proposed controller, a bicycle can not only be stably balanced but also travel to any specified location. We confirm that the controller with DDPG shows better performance than the other baselines such as Normalized Advantage Function (NAF) and Proximal Policy Optimization (PPO). For the performance evaluation, we implemented the proposed algorithm in various settings such as fixed and random speed, start location, and destination location.

Download Full-text

Frontal plane algorithms for dynamic bipedal walking

Robotica ◽

10.1017/s0263574703005253 ◽

2004 ◽

Vol 22 (1) ◽

pp. 29-39 ◽

Cited By ~ 5

Author(s):

Chee-Meng Chew ◽

Gill A. Pratt

Keyword(s):

Reinforcement Learning ◽

Sagittal Plane ◽

Learning Algorithm ◽

Frontal Plane ◽

The Other ◽

Bipedal Walking ◽

Control Law ◽

Bipedal Robot ◽

Foot Placement ◽

Simulation Results

This paper presents two frontal plane algorithms for 3D dynamic bipedal walking. One of which is based on the notion of symmetry and the other uses reinforcement learning algorithm to learn the lateral foot placement. The algorithms are combined with a sagittal plane algorithm and successfully applied to a simulated 3D bipedal robot to achieve level ground walking. The simulation results showed that the choice of the local control law for the stance-ankle roll joint could significantly affect the performance of the frontal plane algorithms.

Download Full-text

Model dependent reinforcement learning algorithm for reservoir operation stochastic optimization

International Journal of Hydrology ◽

10.15406/ijh.2018.02.00129 ◽

2018 ◽

Vol 2 (5) ◽

Author(s):

Li Wenwu

Keyword(s):

Reinforcement Learning ◽

Stochastic Optimization ◽

Reservoir Operation ◽

Learning Algorithm ◽

Reinforcement Learning Algorithm

Download Full-text

Reinforcement learning algorithm for one-warehouse multi-retailer inventory problem

Automation, Mechanical and Electrical Engineering ◽

10.2495/amee140161 ◽

2014 ◽

Author(s):

C.Y. Li ◽

X.T. Wang ◽

T.W. Zhang

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Inventory Problem ◽

Reinforcement Learning Algorithm

Download Full-text

Comparison of the Potency of 8-L-Arginine, 8-D-Arginine and 8-D-Homoarginine Vasopressin Analogs with Substituted Phenylalanine in Position 2

Collection of Czechoslovak Chemical Communications ◽

10.1135/cccc19941439 ◽

1994 ◽

Vol 59 (6) ◽

pp. 1439-1450 ◽

Cited By ~ 2

Author(s):

Miroslava Žertová ◽

Jiřina Slaninová ◽

Zdenko Procházka

Keyword(s):

Amino Acid ◽

Solid Phase Synthesis ◽

Solid Phase ◽

Basic Amino Acid ◽

The Other ◽

Side Chain ◽

Inhibitory Potency ◽

Phase Synthesis ◽

Other Hand ◽

Order Of Magnitude

An analysis of the uterotonic potencies of all analogs having substituted L- or D-tyrosine or -phenylalanine in position 2 and L-arginine, D-arginine or D-homoarginine in position 8 was made. The series of analogs already published was completed by the solid phase synthesis of ten new analogs having L- or D-Phe, L- or D-Phe(2-Et), L- or D-Phe(2,4,6-triMe) or D-Tyr(Me) in position 2 and either L- or D-arginine in position 8. All newly synthesized analogs were found to be uterotonic inhibitors. Deamination increases both the agonistic and antagonistic potency. In the case of phenylalanine analogs the change of configuration from L to D in position 2 enhances the uterotonic inhibition for more than 1 order of magnitude. The L to D change in position 8 enhances the inhibitory potency negligibly. Prolongation of the side chain of the D-basic amino acid in position 8 seems to decrease slightly the inhibitory potency if there is L-substituted amino acid in position 2. On the other hand there is a tendency to the increase of the inhibitory potency if there is D-substituted amino acid in position 2.

Download Full-text

Computational Design of Modular Robots Based on Genetic Algorithm and Reinforcement Learning

Symmetry ◽

10.3390/sym13030471 ◽

2021 ◽

Vol 13 (3) ◽

pp. 471

Author(s):

Jai Hoon Park ◽

Kang Hoon Lee

Keyword(s):

Genetic Algorithm ◽

Reinforcement Learning ◽

Design Space ◽

Learning Algorithm ◽

Computational Design ◽

Computational Method ◽

Learning Ability ◽

Modular Robots ◽

Control Mechanisms ◽

Candidate Structure

Designing novel robots that can cope with a specific task is a challenging problem because of the enormous design space that involves both morphological structures and control mechanisms. To this end, we present a computational method for automating the design of modular robots. Our method employs a genetic algorithm to evolve robotic structures as an outer optimization, and it applies a reinforcement learning algorithm to each candidate structure to train its behavior and evaluate its potential learning ability as an inner optimization. The size of the design space is reduced significantly by evolving only the robotic structure and by performing behavioral optimization using a separate training algorithm compared to that when both the structure and behavior are evolved simultaneously. Mutual dependence between evolution and learning is achieved by regarding the mean cumulative rewards of a candidate structure in the reinforcement learning as its fitness in the genetic algorithm. Therefore, our method searches for prospective robotic structures that can potentially lead to near-optimal behaviors if trained sufficiently. We demonstrate the usefulness of our method through several effective design results that were automatically generated in the process of experimenting with actual modular robotics kit.

Download Full-text