scholarly journals On Principled Entropy Exploration in Policy Optimization

Author(s):  
Jincheng Mei ◽  
Chenjun Xiao ◽  
Ruitong Huang ◽  
Dale Schuurmans ◽  
Martin Müller

In this paper, we investigate Exploratory Conservative Policy Optimization (ECPO), a policy optimization strategy that improves exploration behavior while assuring monotonic progress in a principled objective. ECPO conducts maximum entropy exploration within a mirror descent framework, but updates policies using reversed KL projection. This formulation bypasses undesirable mode seeking behavior and avoids premature convergence to sub-optimal policies, while still supporting strong theoretical properties such as guaranteed policy improvement. Experimental evaluations demonstrate that the proposed method significantly improves practical exploration and surpasses the empirical performance of state-of-the art policy optimization methods in a set of benchmark tasks.

Author(s):  
Hanbo Zhang ◽  
Site Bai ◽  
Xuguang Lan ◽  
David Hsu ◽  
Nanning Zheng

Reinforcement Learning (RL) with sparse rewards is a major challenge. We pro- pose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm’s ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Experimental results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy 14 gradient algorithm for RL with sparse rewards.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
J. P. Vasco ◽  
V. Savona

AbstractWe optimize a silica-encapsulated silicon L3 photonic crystal cavity for ultra-high quality factor by means of a global optimization strategy, where the closest holes surrounding the cavity are varied to minimize out-of-plane losses. We find an optimal value of $$Q_c=4.33\times 10^7$$ Q c = 4.33 × 10 7 , which is predicted to be in the 2 million regime in presence of structural imperfections compatible with state-of-the-art silicon fabrication tolerances.


2021 ◽  
Vol 12 (4) ◽  
pp. 98-116
Author(s):  
Noureddine Boukhari ◽  
Fatima Debbat ◽  
Nicolas Monmarché ◽  
Mohamed Slimane

Evolution strategies (ES) are a family of strong stochastic methods for global optimization and have proved their capability in avoiding local optima more than other optimization methods. Many researchers have investigated different versions of the original evolution strategy with good results in a variety of optimization problems. However, the convergence rate of the algorithm to the global optimum stays asymptotic. In order to accelerate the convergence rate, a hybrid approach is proposed using the nonlinear simplex method (Nelder-Mead) and an adaptive scheme to control the local search application, and the authors demonstrate that such combination yields significantly better convergence. The new proposed method has been tested on 15 complex benchmark functions and applied to the bi-objective portfolio optimization problem and compared with other state-of-the-art techniques. Experimental results show that the performance is improved by this hybridization in terms of solution eminence and strong convergence.


2022 ◽  
pp. 1-12
Author(s):  
Shuailong Li ◽  
Wei Zhang ◽  
Huiwen Zhang ◽  
Xin Zhang ◽  
Yuquan Leng

Model-free reinforcement learning methods have successfully been applied to practical applications such as decision-making problems in Atari games. However, these methods have inherent shortcomings, such as a high variance and low sample efficiency. To improve the policy performance and sample efficiency of model-free reinforcement learning, we propose proximal policy optimization with model-based methods (PPOMM), a fusion method of both model-based and model-free reinforcement learning. PPOMM not only considers the information of past experience but also the prediction information of the future state. PPOMM adds the information of the next state to the objective function of the proximal policy optimization (PPO) algorithm through a model-based method. This method uses two components to optimize the policy: the error of PPO and the error of model-based reinforcement learning. We use the latter to optimize a latent transition model and predict the information of the next state. For most games, this method outperforms the state-of-the-art PPO algorithm when we evaluate across 49 Atari games in the Arcade Learning Environment (ALE). The experimental results show that PPOMM performs better or the same as the original algorithm in 33 games.


2020 ◽  
Vol 34 (04) ◽  
pp. 3962-3969
Author(s):  
Evrard Garcelon ◽  
Mohammad Ghavamzadeh ◽  
Alessandro Lazaric ◽  
Matteo Pirotta

In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.


2021 ◽  
Vol 17 (3) ◽  
pp. e1008256
Author(s):  
Shuonan Chen ◽  
Jackson Loper ◽  
Xiaoyin Chen ◽  
Alex Vaughan ◽  
Anthony M. Zador ◽  
...  

Modern spatial transcriptomics methods can target thousands of different types of RNA transcripts in a single slice of tissue. Many biological applications demand a high spatial density of transcripts relative to the imaging resolution, leading to partial mixing of transcript rolonies in many voxels; unfortunately, current analysis methods do not perform robustly in this highly-mixed setting. Here we develop a new analysis approach, BARcode DEmixing through Non-negative Spatial Regression (BarDensr): we start with a generative model of the physical process that leads to the observed image data and then apply sparse convex optimization methods to estimate the underlying (demixed) rolony densities. We apply BarDensr to simulated and real data and find that it achieves state of the art signal recovery, particularly in densely-labeled regions or data with low spatial resolution. Finally, BarDensr is fast and parallelizable. We provide open-source code as well as an implementation for the ‘NeuroCAAS’ cloud platform.


Author(s):  
Shuang Wang ◽  
John C. Brigham

This work presents a strategy to identify the optimal localized activation and actuation for a morphing thermally activated SMP structure or structural component to obtain a targeted shape change or set of shape features, subject to design objectives such as minimal total required energy and time. This strategy combines numerical representations of the SMP structure’s thermo-mechanical behavior subject to activation and actuation with gradient-based nonlinear optimization methods to solve the morphing inverse problem that includes minimizing cost functions which address thermal and mechanical energy, morphing time, and damage. In particular, the optimization strategy utilizes the adjoint method to efficiently compute the gradient of the objective functional(s) with respect to the design parameters for this coupled thermo-mechanical problem.


2013 ◽  
Vol 17 (2) ◽  
pp. 509-524 ◽  
Author(s):  
Axel Groniewsky

The basic concept in applying numerical optimization methods for power plants optimization problems is to combine a State of the art search algorithm with a powerful, power plant simulation program to optimize the energy conversion system from both economic and thermodynamic viewpoints. Improving the energy conversion system by optimizing the design and operation and studying interactions among plant components requires the investigation of a large number of possible design and operational alternatives. State of the art search algorithms can assist in the development of cost-effective power plant concepts. The aim of this paper is to present how nature-inspired swarm intelligence (especially PSO) can be applied in the field of power plant optimization and how to find solutions for the problems arising and also to apply exergoeconomic optimization technics for thermal power plants.


Sign in / Sign up

Export Citation Format

Share Document