Hindsight Trust Region Policy Optimization

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/459 ◽

2021 ◽

Author(s):

Hanbo Zhang ◽

Site Bai ◽

Xuguang Lan ◽

David Hsu ◽

Nanning Zheng

Keyword(s):

Robot Control ◽

State Of The Art ◽

Convergence Property ◽

Trust Region ◽

Gradient Algorithm ◽

Learning Performance ◽

Kl Divergence ◽

Art Policy ◽

Current Task ◽

Policy Optimization

Reinforcement Learning (RL) with sparse rewards is a major challenge. We pro- pose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm’s ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Experimental results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy 14 gradient algorithm for RL with sparse rewards.

Download Full-text

On Principled Entropy Exploration in Policy Optimization

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/434 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jincheng Mei ◽

Chenjun Xiao ◽

Ruitong Huang ◽

Dale Schuurmans ◽

Martin Müller

Keyword(s):

State Of The Art ◽

Optimization Methods ◽

Optimization Strategy ◽

Mirror Descent ◽

Art Policy ◽

Exploration Behavior ◽

Empirical Performance ◽

Conservative Policy ◽

Seeking Behavior ◽

Policy Optimization

In this paper, we investigate Exploratory Conservative Policy Optimization (ECPO), a policy optimization strategy that improves exploration behavior while assuring monotonic progress in a principled objective. ECPO conducts maximum entropy exploration within a mirror descent framework, but updates policies using reversed KL projection. This formulation bypasses undesirable mode seeking behavior and avoids premature convergence to sub-optimal policies, while still supporting strong theoretical properties such as guaranteed policy improvement. Experimental evaluations demonstrate that the proposed method significantly improves practical exploration and surpasses the empirical performance of state-of-the art policy optimization methods in a set of benchmark tasks.

Download Full-text

Toward Self-Driving Bicycles Using State-of-the-Art Deep Reinforcement Learning Algorithms

Symmetry ◽

10.3390/sym11020290 ◽

2019 ◽

Vol 11 (2) ◽

pp. 290 ◽

Cited By ~ 4

Author(s):

SeungYoon Choi ◽

Tuyen Le ◽

Quang Nguyen ◽

Md Layek ◽

SeungGwan Lee ◽

...

Keyword(s):

Reinforcement Learning ◽

Deep Neural Network ◽

Learning Algorithm ◽

State Of The Art ◽

The Other ◽

Gradient Algorithm ◽

Reward Function ◽

Policy Gradient ◽

Policy Optimization ◽

Start Location

In this paper, we propose a controller for a bicycle using the DDPG (Deep Deterministic Policy Gradient) algorithm, which is a state-of-the-art deep reinforcement learning algorithm. We use a reward function and a deep neural network to build the controller. By using the proposed controller, a bicycle can not only be stably balanced but also travel to any specified location. We confirm that the controller with DDPG shows better performance than the other baselines such as Normalized Advantage Function (NAF) and Proximal Policy Optimization (PPO). For the performance evaluation, we implemented the proposed algorithm in various settings such as fixed and random speed, start location, and destination location.

Download Full-text

Proximal policy optimization with model-based methods

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211935 ◽

2022 ◽

pp. 1-12

Author(s):

Shuailong Li ◽

Wei Zhang ◽

Huiwen Zhang ◽

Xin Zhang ◽

Yuquan Leng

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Transition Model ◽

Practical Applications ◽

Original Algorithm ◽

Policy Performance ◽

Model Based ◽

Model Free ◽

Future State ◽

Policy Optimization

Model-free reinforcement learning methods have successfully been applied to practical applications such as decision-making problems in Atari games. However, these methods have inherent shortcomings, such as a high variance and low sample efficiency. To improve the policy performance and sample efficiency of model-free reinforcement learning, we propose proximal policy optimization with model-based methods (PPOMM), a fusion method of both model-based and model-free reinforcement learning. PPOMM not only considers the information of past experience but also the prediction information of the future state. PPOMM adds the information of the next state to the objective function of the proximal policy optimization (PPO) algorithm through a model-based method. This method uses two components to optimize the policy: the error of PPO and the error of model-based reinforcement learning. We use the latter to optimize a latent transition model and predict the information of the next state. For most games, this method outperforms the state-of-the-art PPO algorithm when we evaluate across 49 Atari games in the Arcade Learning Environment (ALE). The experimental results show that PPOMM performs better or the same as the original algorithm in 33 games.

Download Full-text

Deterministic Value-Policy Gradients

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5732 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3316-3323

Author(s):

Qingpeng Cai ◽

Ling Pan ◽

Pingzhong Tang

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Infinite Horizon ◽

Gradient Algorithm ◽

Continuous Control ◽

Model Bias ◽

Model Free ◽

Policy Gradient ◽

Analytical Gradients

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Download Full-text

Complete Bottom-Up Predicate Invention in Meta-Interpretive Learning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/320 ◽

2020 ◽

Author(s):

Céline Hocquette ◽

Stephen H. Muggleton

Keyword(s):

State Of The Art ◽

Order Logic ◽

Learning Performance ◽

Sample Complexity ◽

Logic Programs ◽

Top Down ◽

Bottom Up ◽

Predicate Invention ◽

Feature Discovery ◽

Second Order Logic

Predicate Invention in Meta-Interpretive Learning (MIL) is generally based on a top-down approach, and the search for a consistent hypothesis is carried out starting from the positive examples as goals. We consider augmenting top-down MIL systems with a bottom-up step during which the background knowledge is generalised with an extension of the immediate consequence operator for second-order logic programs. This new method provides a way to perform extensive predicate invention useful for feature discovery. We demonstrate this method is complete with respect to a fragment of dyadic datalog. We theoretically prove this method reduces the number of clauses to be learned for the top-down learner, which in turn can reduce the sample complexity. We formalise an equivalence relation for predicates which is used to eliminate redundant predicates. Our experimental results suggest pairing the state-of-the-art MIL system Metagol with an initial bottom-up step can significantly improve learning performance.

Download Full-text

A Modified PRP Conjugate Gradient Algorithm with Trust Region for Optimization Problems

Numerical Functional Analysis and Optimization ◽

10.1080/01630563.2011.554948 ◽

2011 ◽

Vol 32 (5) ◽

pp. 496-506 ◽

Cited By ~ 2

Author(s):

Xiangrong Li ◽

Qingsong Ruan

Keyword(s):

Conjugate Gradient ◽

Optimization Problems ◽

Trust Region ◽

Conjugate Gradient Algorithm ◽

Gradient Algorithm

Download Full-text

A Modified Three-Term Type CD Conjugate Gradient Algorithm for Unconstrained Optimization Problems

Mathematical Problems in Engineering ◽

10.1155/2020/4381515 ◽

2020 ◽

Vol 2020 ◽

pp. 1-14

Author(s):

Zhan Wang ◽

Pengyuan Li ◽

Xiangrong Li ◽

Hongtruong Pham

Keyword(s):

Conjugate Gradient Method ◽

Conjugate Gradient ◽

Gradient Method ◽

Optimization Problems ◽

Gradient Methods ◽

Trust Region ◽

Conjugate Gradient Algorithm ◽

Gradient Algorithm ◽

General Function ◽

Term Type

Conjugate gradient methods are well-known methods which are widely applied in many practical fields. CD conjugate gradient method is one of the classical types. In this paper, a modified three-term type CD conjugate gradient algorithm is proposed. Some good features are presented as follows: (i) A modified three-term type CD conjugate gradient formula is presented. (ii) The given algorithm possesses sufficient descent property and trust region property. (iii) The algorithm has global convergence with the modified weak Wolfe–Powell (MWWP) line search technique and projection technique for general function. The new algorithm has made great progress in numerical experiments. It shows that the modified three-term type CD conjugate gradient method is more competitive than the classical CD conjugate gradient method.

Download Full-text

Data-Driven Optimization of Reward-Risk Ratio Measures

INFORMS Journal on Computing ◽

10.1287/ijoc.2020.1002 ◽

2020 ◽

Author(s):

Ran Ji ◽

Miguel A. Lejeune

Keyword(s):

Optimization Problems ◽

State Of The Art ◽

Convergence Property ◽

Computational Study ◽

Data Driven ◽

Mixed Integer ◽

Finite Convergence ◽

Integer Nonlinear Programming ◽

Distributionally Robust ◽

Risk Ratios

We investigate a class of fractional distributionally robust optimization problems with uncertain probabilities. They consist in the maximization of ambiguous fractional functions representing reward-risk ratios and have a semi-infinite programming epigraphic formulation. We derive a new fully parameterized closed-form to compute a new bound on the size of the Wasserstein ambiguity ball. We design a data-driven reformulation and solution framework. The reformulation phase involves the derivation of the support function of the ambiguity set and the concave conjugate of the ratio function. We design modular bisection algorithms which enjoy the finite convergence property. This class of problems has wide applicability in finance, and we specify new ambiguous portfolio optimization models for the Sharpe and Omega ratios. The computational study shows the applicability and scalability of the framework to solve quickly large, industry-relevant-size problems, which cannot be solved in one day with state-of-the-art mixed-integer nonlinear programming (MINLP) solvers.

Download Full-text

A conjugate gradient algorithm and its application in large-scale optimization problems and image restoration

Journal of Inequalities and Applications ◽

10.1186/s13660-019-2192-6 ◽

2019 ◽

Vol 2019 (1) ◽

Cited By ~ 1

Author(s):

Gonglin Yuan ◽

Tingting Li ◽

Wujie Hu

Keyword(s):

Image Restoration ◽

Conjugate Gradient ◽

Large Scale ◽

Optimization Problems ◽

Trust Region ◽

Conjugate Gradient Algorithm ◽

Gradient Algorithm ◽

Smooth Functions ◽

Nonsmooth Functions ◽

Large Scale Unconstrained Optimization

Abstract To solve large-scale unconstrained optimization problems, a modified PRP conjugate gradient algorithm is proposed and is found to be interesting because it combines the steepest descent algorithm with the conjugate gradient method and successfully fully utilizes their excellent properties. For smooth functions, the objective algorithm sufficiently utilizes information about the gradient function and the previous direction to determine the next search direction. For nonsmooth functions, a Moreau–Yosida regularization is introduced into the proposed algorithm, which simplifies the process in addressing complex problems. The proposed algorithm has the following characteristics: (i) a sufficient descent feature as well as a trust region trait; (ii) the ability to achieve global convergence; (iii) numerical results for large-scale smooth/nonsmooth functions prove that the proposed algorithm is outstanding compared to other similar optimization methods; (iv) image restoration problems are done to turn out that the given algorithm is successful.

Download Full-text

An Accelerated Conjugate Gradient Algorithm for Solving Nonlinear Monotone Equations and Image Restoration Problems

Mathematical Problems in Engineering ◽

10.1155/2020/7945467 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Haishan Feng ◽

Tingting Li

Keyword(s):

Image Restoration ◽

Conjugate Gradient ◽

Trust Region ◽

Step Length ◽

Conjugate Gradient Algorithm ◽

Gradient Algorithm ◽

Search Technique ◽

Conjugate Gradient Algorithms ◽

Monotone Equations ◽

Nonlinear Monotone Equations

Combining the three-term conjugate gradient method of Yuan and Zhang and the acceleration step length of Andrei with the hyperplane projection method of Solodov and Svaiter, we propose an accelerated conjugate gradient algorithm for solving nonlinear monotone equations in this paper. The presented algorithm has the following properties: (i) All search directions generated by the algorithm satisfy the sufficient descent and trust region properties independent of the line search technique. (ii) A derivative-free search technique is proposed along the direction to obtain the step length αk. (iii) If ϕk=−αkhk−hwkTdk>0, then an acceleration scheme is used to modify the step length in a multiplicative manner and create a point. (iv) If the point satisfies the given condition, then it is the next point; otherwise, the hyperplane projection technique is used to obtain the next point. (v) The global convergence of the proposed algorithm is established under some suitable conditions. Numerical comparisons with other conjugate gradient algorithms show that the accelerated computing scheme is more competitive. In addition, the presented algorithm can also be applied to image restoration.

Download Full-text