Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Preconditioning and Regularization Enable Faster Reinforcement Learning Natural policy gradient (NPG) methods, in conjunction with entropy regularization to encourage exploration, are among the most popular policy optimization algorithms in contemporary reinforcement learning. Despite the empirical success, the theoretical underpinnings for NPG methods remain severely limited. In “Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization”, Cen, Cheng, Chen, Wei, and Chi develop nonasymptotic convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on tabular discounted Markov decision processes. Assuming access to exact policy evaluation, the authors demonstrate that the algorithm converges linearly at an astonishing rate that is independent of the dimension of the state-action space. Moreover, the algorithm is provably stable vis-à-vis inexactness of policy evaluation. Accommodating a wide range of learning rates, this convergence result highlights the role of preconditioning and regularization in enabling fast convergence.

Download Full-text

Reinforcement Learning in Linear Quadratic Deep Structured Teams: Global Convergence of Policy Gradient Methods

2020 59th IEEE Conference on Decision and Control (CDC) ◽

10.1109/cdc42340.2020.9304234 ◽

2020 ◽

Author(s):

Vida Fathi ◽

Jalal Arabneydi ◽

Amir G. Aghdam

Keyword(s):

Reinforcement Learning ◽

Global Convergence ◽

Gradient Methods ◽

Linear Quadratic ◽

Policy Gradient

Download Full-text

Visual Navigation with Asynchronous Proximal Policy Optimization in Artificial Agents

Journal of Robotics ◽

10.1155/2020/8702962 ◽

2020 ◽

Vol 2020 ◽

pp. 1-7

Author(s):

Fanyu Zeng ◽

Chen Wang

Keyword(s):

Reinforcement Learning ◽

Gradient Descent ◽

Gradient Methods ◽

Visual Navigation ◽

Experimental Results ◽

Artificial Agents ◽

Policy Gradient ◽

Policy Optimization ◽

Navigation Method ◽

Better Than

Vanilla policy gradient methods suffer from high variance, leading to unstable policies during training, where the policy’s performance fluctuates drastically between iterations. To address this issue, we analyze the policy optimization process of the navigation method based on deep reinforcement learning (DRL) that uses asynchronous gradient descent for optimization. A variant navigation (asynchronous proximal policy optimization navigation, appoNav) is presented that can guarantee the policy monotonic improvement during the process of policy optimization. Our experiments are tested in DeepMind Lab, and the experimental results show that the artificial agents with appoNav perform better than the compared algorithm.

Download Full-text

Global convergence result for conjugate gradient methods

Journal of Optimization Theory and Applications ◽

10.1007/bf00939927 ◽

1991 ◽

Vol 71 (2) ◽

pp. 399-405 ◽

Cited By ~ 80

Author(s):

Y. F. Hu ◽

C. Storey

Keyword(s):

Global Convergence ◽

Conjugate Gradient ◽

Gradient Methods ◽

Convergence Result ◽

Conjugate Gradient Methods

Download Full-text

Energy-Efficient Slithering Gait Exploration for a Snake-Like Robot Based on Reinforcement Learning

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/785 ◽

2019 ◽

Cited By ~ 4

Author(s):

Zhenshan Bing ◽

Christian Lemke ◽

Zhuangyi Jiang ◽

Kai Huang ◽

Alois Knoll

Keyword(s):

Reinforcement Learning ◽

Energy Efficient ◽

Degrees Of Freedom ◽

Bayesian Optimization ◽

Control Task ◽

Flexible Bodies ◽

Model Free ◽

Novel Approach ◽

Wide Range ◽

Policy Optimization

Similar to their counterparts in nature, the flexible bodies of snake-like robots enhance their movement capability and adaptability in diverse environments. However, this flexibility corresponds to a complex control task involving highly redundant degrees of freedom, where traditional model-based methods usually fail to propel the robots energy-efficiently. In this work, we present a novel approach for designing an energy-efficient slithering gait for a snake-like robot using a model-free reinforcement learning (RL) algorithm. Specifically, we present an RL-based controller for generating locomotion gaits at a wide range of velocities, which is trained using the proximal policy optimization (PPO) algorithm. Meanwhile, a traditional parameterized gait controller is presented and the parameter sets are optimized using the grid search and Bayesian optimization algorithms for the purposes of reasonable comparisons. Based on the analysis of the simulation results, we demonstrate that this RL-based controller exhibits very natural and adaptive movements, which are also substantially more energy-efficient than the gaits generated by the parameterized controller. Videos are shown at https://videoviewsite.wixsite.com/rlsnake .

Download Full-text

Diagnostic Evaluation of Policy-Gradient-Based Ranking

Electronics ◽

10.3390/electronics11010037 ◽

2021 ◽

Vol 11 (1) ◽

pp. 37

Author(s):

Hai-Tao Yu ◽

Degen Huang ◽

Fuji Ren ◽

Lishuang Li

Keyword(s):

Reinforcement Learning ◽

Learning To Rank ◽

Careful Examination ◽

Ranking Methods ◽

Adversarial Learning ◽

Wide Range ◽

Depth Analysis ◽

Policy Gradient ◽

Gradient Based ◽

The Impact

Learning-to-rank has been intensively studied and has shown significantly increasing values in a wide range of domains, such as web search, recommender systems, dialogue systems, machine translation, and even computational biology, to name a few. In light of recent advances in neural networks, there has been a strong and continuing interest in exploring how to deploy popular techniques, such as reinforcement learning and adversarial learning, to solve ranking problems. However, armed with the aforesaid popular techniques, most studies tend to show how effective a new method is. A comprehensive comparison between techniques and an in-depth analysis of their deficiencies are somehow overlooked. This paper is motivated by the observation that recent ranking methods based on either reinforcement learning or adversarial learning boil down to policy-gradient-based optimization. Based on the widely used benchmark collections with complete information (where relevance labels are known for all items), such as MSLRWEB30K and Yahoo-Set1, we thoroughly investigate the extent to which policy-gradient-based ranking methods are effective. On one hand, we analytically identify the pitfalls of policy-gradient-based ranking. On the other hand, we experimentally compare a wide range of representative methods. The experimental results echo our analysis and show that policy-gradient-based ranking methods are, by a large margin, inferior to many conventional ranking methods. Regardless of whether we use reinforcement learning or adversarial learning, the failures are largely attributable to the gradient estimation based on sampled rankings, which significantly diverge from ideal rankings. In particular, the larger the number of documents per query and the more fine-grained the ground-truth labels, the greater the impact policy-gradient-based ranking suffers. Careful examination of this weakness is highly recommended for developing enhanced methods based on policy gradient.

Download Full-text

Quadrotor Motion Control Using Deep Reinforcement Learning

Journal of Unmanned Vehicle Systems ◽

10.1139/juvs-2021-0010 ◽

2021 ◽

Author(s):

Zifei Jiang ◽

Alan F. Lynch

Keyword(s):

Reinforcement Learning ◽

Neural Nets ◽

Neural Net ◽

Reward Function ◽

Model Free ◽

Policy Gradient ◽

Aerial Vehicle ◽

Stochastic Controller ◽

Policy Optimization ◽

Gradient Approach

We present a deep neural net-based controller trained by a model-free reinforcement learning (RL) algorithm to achieve hover stabilization for a quadrotor unmanned aerial vehicle (UAV). With RL, two neural nets are trained. One neural net is used as a stochastic controller which gives the distribution of control inputs. The other maps the UAV state to a scalar which estimates the reward of the controller. A proximal policy optimization (PPO) method, which is an actor-critic policy gradient approach, is used to train the neural nets. Simulation results show that the trained controller achieves a comparable level of performance to a manually-tuned PID controller, despite not depending on any model information. The paper considers different choices of reward function and their influence on controller performance.

Download Full-text

Reinforcement Learning Enables Field-Development Policy Optimization

Journal of Petroleum Technology ◽

10.2118/0921-0046-jpt ◽

2021 ◽

Vol 73 (09) ◽

pp. 46-47

Author(s):

Chris Carpenter

Keyword(s):

Dynamic Programming ◽

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Development Policy ◽

The State ◽

Field Development ◽

Markov Decision ◽

Policy Optimization ◽

Sequential Nature

This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 201254, “Reinforcement Learning for Field-Development Policy Optimization,” by Giorgio De Paola, SPE, and Cristina Ibanez-Llano, Repsol, and Jesus Rios, IBM, et al., prepared for the 2020 SPE Annual Technical Conference and Exhibition, originally scheduled to be held in Denver, Colorado, 5–7 October. The paper has not been peer reviewed. A field-development plan consists of a sequence of decisions. Each action taken affects the reservoir and conditions any future decision. The presence of uncertainty associated with this process, however, is undeniable. The novelty of the approach proposed by the authors in the complete paper is the consideration of the sequential nature of the decisions through the framework of dynamic programming (DP) and reinforcement learning (RL). This methodology allows moving the focus from a static field-development plan optimization to a more-dynamic framework that the authors call field-development policy optimization. This synopsis focuses on the methodology, while the complete paper also contains a real-field case of application of the methodology. Methodology Deep RL (DRL). RL is considered an important learning paradigm in artificial intelligence (AI) but differs from supervised or unsupervised learning, the most commonly known types currently studied in the field of machine learning. During the last decade, RL has attracted greater attention because of success obtained in applications related to games and self-driving cars resulting from its combination with deep-learning architectures such as DRL, which has allowed RL to scale on to previously unsolvable problems and, therefore, solve much larger sequential decision problems. RL, also referred to as stochastic approximate dynamic programming, is a goal-directed sequential-learning-from-interaction paradigm. The learner or agent is not told what to do but instead has to learn which actions or decisions yield a maximum reward through interaction with an uncertain environment without losing too much reward along the way. This way of learning from interaction to achieve a goal must be achieved in balance with the exploration and exploitation of possible actions. Another key characteristic of this type of problem is its sequential nature, where the actions taken by the agent affect the environment itself and, therefore, the subsequent data it receives and the subsequent actions to be taken. Mathematically, such problems are formulated in the framework of the Markov decision process (MDP) that primarily arises in the field of optimal control. An RL problem consists of two principal parts: the agent, or decision-making engine, and the environment, the interactive world for an agent (in this case, the reservoir). Sequentially, at each timestep, the agent takes an action (e.g., changing control rates or deciding a well location) that makes the environment (reservoir) transition from one state to another. Next, the agent receives a reward (e.g., a cash flow) and an observation of the state of the environment (partial or total) before taking the next action. All relevant information informing the agent of the state of the system is assumed to be included in the last state observed by the agent (Markov property). If the agent observes the full environment state once it has acted, the MDP is said to be fully observable; otherwise, a partially observable Markov decision process (POMDP) results. The agent’s objective is to learn policy mapping from states (MDPs) or histories (POMDPs) to actions such that the agent’s cumulated (discounted) reward in the long run is maximized.

Download Full-text

Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail

PLoS Computational Biology ◽

10.1371/journal.pcbi.1000586 ◽

2009 ◽

Vol 5 (12) ◽

pp. e1000586 ◽

Cited By ~ 55

Author(s):

Eleni Vasilaki ◽

Nicolas Frémaux ◽

Robert Urbanczik ◽

Walter Senn ◽

Wulfram Gerstner

Keyword(s):

Reinforcement Learning ◽

Gradient Methods ◽

Action Space ◽

Continuous State ◽

Policy Gradient

Download Full-text

Multi-objective Reinforcement Learning through Continuous Pareto Manifold Approximation

Journal of Artificial Intelligence Research ◽

10.1613/jair.4961 ◽

2016 ◽

Vol 57 ◽

pp. 187-227 ◽

Cited By ~ 7

Author(s):

Simone Parisi ◽

Matteo Pirotta ◽

Marcello Restelli

Keyword(s):

Reinforcement Learning ◽

Pareto Frontier ◽

Water Reservoir ◽

Continuous Approximation ◽

Linear Quadratic ◽

Multi Objective ◽

Gradient Algorithms ◽

Conflicting Objectives ◽

Policy Gradient ◽

Markov Decision

Many real-world control applications, from economics to robotics, are characterized by the presence of multiple conflicting objectives. In these problems, the standard concept of optimality is replaced by Pareto-optimality and the goal is to find the Pareto frontier, a set of solutions representing different compromises among the objectives. Despite recent advances in multi-objective optimization, achieving an accurate representation of the Pareto frontier is still an important challenge. In this paper, we propose a reinforcement learning policy gradient approach to learn a continuous approximation of the Pareto frontier in multi-objective Markov Decision Problems (MOMDPs). Differently from previous policy gradient algorithms, where n optimization routines are executed to have n solutions, our approach performs a single gradient ascent run, generating at each step an improved continuous approximation of the Pareto frontier. The idea is to optimize the parameters of a function defining a manifold in the policy parameters space, so that the corresponding image in the objectives space gets as close as possible to the true Pareto frontier. Besides deriving how to compute and estimate such gradient, we will also discuss the non-trivial issue of defining a metric to assess the quality of the candidate Pareto frontiers. Finally, the properties of the proposed approach are empirically evaluated on two problems, a linear-quadratic Gaussian regulator and a water reservoir control task.

Download Full-text

Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/821 ◽

2019 ◽

Author(s):

Muhammad Masood ◽

Finale Doshi-Velez

Keyword(s):

Reinforcement Learning ◽

Optimization Technique ◽

Gradient Methods ◽

Domain Expert ◽

Learning Methods ◽

Maximum Mean Discrepancy ◽

Optimal Policies ◽

Policy Gradient ◽

Gradient Based ◽

The Difference

Standard reinforcement learning methods aim to master one way of solving a task whereas there may exist multiple near-optimal policies. Being able to identify this collection of near-optimal policies can allow a domain expert to efficiently explore the space of reasonable solutions. Unfortunately, existing approaches that quantify uncertainty over policies are not ultimately relevant to finding policies with qualitatively distinct behaviors. In this work, we formalize the difference between policies as a difference between the distribution of trajectories induced by each policy, which encourages diversity with respect to both state visitation and action choices. We derive a gradient-based optimization technique that can be combined with existing policy gradient methods to now identify diverse collections of well-performing policies. We demonstrate our approach on benchmarks and a healthcare task.

Download Full-text