Critic PI2: Master Continuous Planning via Policy Improvement with Path Integrals and Deep Actor-Critic Reinforcement Learning

We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent’s behavior. Factored reinforcement learning, on the other hand, is known to make good use of the data provided. It can achieve a better sample complexity by exploiting independence between features of the environment, but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm.

Download Full-text

An introduction to stochastic control theory, path integrals and reinforcement learning

10.1063/1.2709596 ◽

2007 ◽

Cited By ~ 33

Author(s):

Hilbert J. Kappen

Keyword(s):

Reinforcement Learning ◽

Control Theory ◽

Stochastic Control ◽

Path Integrals ◽

Stochastic Control Theory

Download Full-text

A Model-Free Reinforcement Learning Approach Using Monte Carlo Method for Production Ramp-Up Policy Improvement - A Copy Exactly Test Case

IFAC Proceedings Volumes ◽

10.3182/20120523-3-ro-2023.00288 ◽

2012 ◽

Vol 45 (6) ◽

pp. 1628-1634 ◽

Cited By ~ 5

Author(s):

Stefanos C. Doltsinis ◽

Niels Lohse

Keyword(s):

Monte Carlo ◽

Reinforcement Learning ◽

Monte Carlo Method ◽

Test Case ◽

Learning Approach ◽

Policy Improvement ◽

Model Free

Download Full-text

Safe and Sample-Efficient Reinforcement Learning Algorithms for Factored Environments

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/919 ◽

2019 ◽

Author(s):

Thiago D. Simão

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Learning Algorithms ◽

Transition Function ◽

High Confidence ◽

Policy Improvement ◽

Markov Decision ◽

Improvement Method ◽

Better Than

Reinforcement Learning (RL) deals with problems that can be modeled as a Markov Decision Process (MDP) where the transition function is unknown. In situations where an arbitrary policy pi is already in execution and the experiences with the environment were recorded in a batch D, an RL algorithm can use D to compute a new policy pi'. However, the policy computed by traditional RL algorithms might have worse performance compared to pi. Our goal is to develop safe RL algorithms, where the agent has a high confidence that the performance of pi' is better than the performance of pi given D. To develop sample-efficient and safe RL algorithms we combine ideas from exploration strategies in RL with a safe policy improvement method.

Download Full-text