scholarly journals Hindsight Trust Region Policy Optimization

Author(s):  
Hanbo Zhang ◽  
Site Bai ◽  
Xuguang Lan ◽  
David Hsu ◽  
Nanning Zheng

Reinforcement Learning (RL) with sparse rewards is a major challenge. We pro- pose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm’s ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Experimental results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy 14 gradient algorithm for RL with sparse rewards.

Author(s):  
Jincheng Mei ◽  
Chenjun Xiao ◽  
Ruitong Huang ◽  
Dale Schuurmans ◽  
Martin Müller

In this paper, we investigate Exploratory Conservative Policy Optimization (ECPO), a policy optimization strategy that improves exploration behavior while assuring monotonic progress in a principled objective. ECPO conducts maximum entropy exploration within a mirror descent framework, but updates policies using reversed KL projection. This formulation bypasses undesirable mode seeking behavior and avoids premature convergence to sub-optimal policies, while still supporting strong theoretical properties such as guaranteed policy improvement. Experimental evaluations demonstrate that the proposed method significantly improves practical exploration and surpasses the empirical performance of state-of-the art policy optimization methods in a set of benchmark tasks.


Symmetry ◽  
2019 ◽  
Vol 11 (2) ◽  
pp. 290 ◽  
Author(s):  
SeungYoon Choi ◽  
Tuyen Le ◽  
Quang Nguyen ◽  
Md Layek ◽  
SeungGwan Lee ◽  
...  

In this paper, we propose a controller for a bicycle using the DDPG (Deep Deterministic Policy Gradient) algorithm, which is a state-of-the-art deep reinforcement learning algorithm. We use a reward function and a deep neural network to build the controller. By using the proposed controller, a bicycle can not only be stably balanced but also travel to any specified location. We confirm that the controller with DDPG shows better performance than the other baselines such as Normalized Advantage Function (NAF) and Proximal Policy Optimization (PPO). For the performance evaluation, we implemented the proposed algorithm in various settings such as fixed and random speed, start location, and destination location.


2022 ◽  
pp. 1-12
Author(s):  
Shuailong Li ◽  
Wei Zhang ◽  
Huiwen Zhang ◽  
Xin Zhang ◽  
Yuquan Leng

Model-free reinforcement learning methods have successfully been applied to practical applications such as decision-making problems in Atari games. However, these methods have inherent shortcomings, such as a high variance and low sample efficiency. To improve the policy performance and sample efficiency of model-free reinforcement learning, we propose proximal policy optimization with model-based methods (PPOMM), a fusion method of both model-based and model-free reinforcement learning. PPOMM not only considers the information of past experience but also the prediction information of the future state. PPOMM adds the information of the next state to the objective function of the proximal policy optimization (PPO) algorithm through a model-based method. This method uses two components to optimize the policy: the error of PPO and the error of model-based reinforcement learning. We use the latter to optimize a latent transition model and predict the information of the next state. For most games, this method outperforms the state-of-the-art PPO algorithm when we evaluate across 49 Atari games in the Arcade Learning Environment (ALE). The experimental results show that PPOMM performs better or the same as the original algorithm in 33 games.


2020 ◽  
Vol 34 (04) ◽  
pp. 3316-3323
Author(s):  
Qingpeng Cai ◽  
Ling Pan ◽  
Pingzhong Tang

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.


Author(s):  
Céline Hocquette ◽  
Stephen H. Muggleton

Predicate Invention in Meta-Interpretive Learning (MIL) is generally based on a top-down approach, and the search for a consistent hypothesis is carried out starting from the positive examples as goals. We consider augmenting top-down MIL systems with a bottom-up step during which the background knowledge is generalised with an extension of the immediate consequence operator for second-order logic programs. This new method provides a way to perform extensive predicate invention useful for feature discovery. We demonstrate this method is complete with respect to a fragment of dyadic datalog. We theoretically prove this method reduces the number of clauses to be learned for the top-down learner, which in turn can reduce the sample complexity. We formalise an equivalence relation for predicates which is used to eliminate redundant predicates. Our experimental results suggest pairing the state-of-the-art MIL system Metagol with an initial bottom-up step can significantly improve learning performance.


2020 ◽  
Vol 2020 ◽  
pp. 1-14
Author(s):  
Zhan Wang ◽  
Pengyuan Li ◽  
Xiangrong Li ◽  
Hongtruong Pham

Conjugate gradient methods are well-known methods which are widely applied in many practical fields. CD conjugate gradient method is one of the classical types. In this paper, a modified three-term type CD conjugate gradient algorithm is proposed. Some good features are presented as follows: (i) A modified three-term type CD conjugate gradient formula is presented. (ii) The given algorithm possesses sufficient descent property and trust region property. (iii) The algorithm has global convergence with the modified weak Wolfe–Powell (MWWP) line search technique and projection technique for general function. The new algorithm has made great progress in numerical experiments. It shows that the modified three-term type CD conjugate gradient method is more competitive than the classical CD conjugate gradient method.


Author(s):  
Ran Ji ◽  
Miguel A. Lejeune

We investigate a class of fractional distributionally robust optimization problems with uncertain probabilities. They consist in the maximization of ambiguous fractional functions representing reward-risk ratios and have a semi-infinite programming epigraphic formulation. We derive a new fully parameterized closed-form to compute a new bound on the size of the Wasserstein ambiguity ball. We design a data-driven reformulation and solution framework. The reformulation phase involves the derivation of the support function of the ambiguity set and the concave conjugate of the ratio function. We design modular bisection algorithms which enjoy the finite convergence property. This class of problems has wide applicability in finance, and we specify new ambiguous portfolio optimization models for the Sharpe and Omega ratios. The computational study shows the applicability and scalability of the framework to solve quickly large, industry-relevant-size problems, which cannot be solved in one day with state-of-the-art mixed-integer nonlinear programming (MINLP) solvers.


Author(s):  
Gonglin Yuan ◽  
Tingting Li ◽  
Wujie Hu

Abstract To solve large-scale unconstrained optimization problems, a modified PRP conjugate gradient algorithm is proposed and is found to be interesting because it combines the steepest descent algorithm with the conjugate gradient method and successfully fully utilizes their excellent properties. For smooth functions, the objective algorithm sufficiently utilizes information about the gradient function and the previous direction to determine the next search direction. For nonsmooth functions, a Moreau–Yosida regularization is introduced into the proposed algorithm, which simplifies the process in addressing complex problems. The proposed algorithm has the following characteristics: (i) a sufficient descent feature as well as a trust region trait; (ii) the ability to achieve global convergence; (iii) numerical results for large-scale smooth/nonsmooth functions prove that the proposed algorithm is outstanding compared to other similar optimization methods; (iv) image restoration problems are done to turn out that the given algorithm is successful.


2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Haishan Feng ◽  
Tingting Li

Combining the three-term conjugate gradient method of Yuan and Zhang and the acceleration step length of Andrei with the hyperplane projection method of Solodov and Svaiter, we propose an accelerated conjugate gradient algorithm for solving nonlinear monotone equations in this paper. The presented algorithm has the following properties: (i) All search directions generated by the algorithm satisfy the sufficient descent and trust region properties independent of the line search technique. (ii) A derivative-free search technique is proposed along the direction to obtain the step length αk. (iii) If ϕk=−αkhk−hwkTdk>0, then an acceleration scheme is used to modify the step length in a multiplicative manner and create a point. (iv) If the point satisfies the given condition, then it is the next point; otherwise, the hyperplane projection technique is used to obtain the next point. (v) The global convergence of the proposed algorithm is established under some suitable conditions. Numerical comparisons with other conjugate gradient algorithms show that the accelerated computing scheme is more competitive. In addition, the presented algorithm can also be applied to image restoration.


Sign in / Sign up

Export Citation Format

Share Document