On Principled Entropy Exploration in Policy Optimization

In this paper, we investigate Exploratory Conservative Policy Optimization (ECPO), a policy optimization strategy that improves exploration behavior while assuring monotonic progress in a principled objective. ECPO conducts maximum entropy exploration within a mirror descent framework, but updates policies using reversed KL projection. This formulation bypasses undesirable mode seeking behavior and avoids premature convergence to sub-optimal policies, while still supporting strong theoretical properties such as guaranteed policy improvement. Experimental evaluations demonstrate that the proposed method significantly improves practical exploration and surpasses the empirical performance of state-of-the art policy optimization methods in a set of benchmark tasks.

Download Full-text

Hindsight Trust Region Policy Optimization

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/459 ◽

2021 ◽

Author(s):

Hanbo Zhang ◽

Site Bai ◽

Xuguang Lan ◽

David Hsu ◽

Nanning Zheng

Keyword(s):

Robot Control ◽

State Of The Art ◽

Convergence Property ◽

Trust Region ◽

Gradient Algorithm ◽

Learning Performance ◽

Kl Divergence ◽

Art Policy ◽

Current Task ◽

Policy Optimization

Reinforcement Learning (RL) with sparse rewards is a major challenge. We pro- pose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm’s ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Experimental results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy 14 gradient algorithm for RL with sparse rewards.

Download Full-text

Global optimization of an encapsulated Si/SiO$$_2$$ L3 cavity with a 43 million quality factor

Scientific Reports ◽

10.1038/s41598-021-89410-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

J. P. Vasco ◽

V. Savona

Keyword(s):

Global Optimization ◽

Photonic Crystal ◽

Quality Factor ◽

State Of The Art ◽

Optimization Strategy ◽

High Quality ◽

Optimal Value ◽

Out Of Plane ◽

Fabrication Tolerances ◽

Structural Imperfections

AbstractWe optimize a silica-encapsulated silicon L3 photonic crystal cavity for ultra-high quality factor by means of a global optimization strategy, where the closest holes surrounding the cavity are varied to minimize out-of-plane losses. We find an optimal value of $$Q_c=4.33\times 10^7$$ Q c = 4.33 × 10 7 , which is predicted to be in the 2 million regime in presence of structural imperfections compatible with state-of-the-art silicon fabrication tolerances.

Download Full-text

Solving Mono- and Multi-Objective Problems Using Hybrid Evolutionary Algorithms and Nelder-Mead Method

International Journal of Applied Metaheuristic Computing ◽

10.4018/ijamc.2021100106 ◽

2021 ◽

Vol 12 (4) ◽

pp. 98-116

Author(s):

Noureddine Boukhari ◽

Fatima Debbat ◽

Nicolas Monmarché ◽

Mohamed Slimane

Keyword(s):

Convergence Rate ◽

Optimization Problem ◽

Optimization Problems ◽

State Of The Art ◽

Hybrid Approach ◽

Optimization Methods ◽

Global Optimum ◽

Stochastic Methods ◽

Local Optima ◽

Portfolio Optimization Problem

Evolution strategies (ES) are a family of strong stochastic methods for global optimization and have proved their capability in avoiding local optima more than other optimization methods. Many researchers have investigated different versions of the original evolution strategy with good results in a variety of optimization problems. However, the convergence rate of the algorithm to the global optimum stays asymptotic. In order to accelerate the convergence rate, a hybrid approach is proposed using the nonlinear simplex method (Nelder-Mead) and an adaptive scheme to control the local search application, and the authors demonstrate that such combination yields significantly better convergence. The new proposed method has been tested on 15 complex benchmark functions and applied to the bi-objective portfolio optimization problem and compared with other state-of-the-art techniques. Experimental results show that the performance is improved by this hybridization in terms of solution eminence and strong convergence.

Download Full-text

Proximal policy optimization with model-based methods

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211935 ◽

2022 ◽

pp. 1-12

Author(s):

Shuailong Li ◽

Wei Zhang ◽

Huiwen Zhang ◽

Xin Zhang ◽

Yuquan Leng

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Transition Model ◽

Practical Applications ◽

Original Algorithm ◽

Policy Performance ◽

Model Based ◽

Model Free ◽

Future State ◽

Policy Optimization

Model-free reinforcement learning methods have successfully been applied to practical applications such as decision-making problems in Atari games. However, these methods have inherent shortcomings, such as a high variance and low sample efficiency. To improve the policy performance and sample efficiency of model-free reinforcement learning, we propose proximal policy optimization with model-based methods (PPOMM), a fusion method of both model-based and model-free reinforcement learning. PPOMM not only considers the information of past experience but also the prediction information of the future state. PPOMM adds the information of the next state to the objective function of the proximal policy optimization (PPO) algorithm through a model-based method. This method uses two components to optimize the policy: the error of PPO and the error of model-based reinforcement learning. We use the latter to optimize a latent transition model and predict the information of the next state. For most games, this method outperforms the state-of-the-art PPO algorithm when we evaluate across 49 Atari games in the Arcade Learning Environment (ALE). The experimental results show that PPOMM performs better or the same as the original algorithm in 33 games.

Download Full-text

Improved Algorithms for Conservative Exploration in Bandits

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5812 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3962-3969

Author(s):

Evrard Garcelon ◽

Mohammad Ghavamzadeh ◽

Alessandro Lazaric ◽

Matteo Pirotta

Keyword(s):

State Of The Art ◽

Digital Marketing ◽

Learning Problem ◽

Online Learning Algorithms ◽

Empirical Performance ◽

Regret Bounds ◽

Healthcare Finance ◽

And Robotics ◽

Novel Algorithm ◽

Real World Problems

In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.

Download Full-text

BARcode DEmixing through Non-negative Spatial Regression (BarDensr)

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008256 ◽

2021 ◽

Vol 17 (3) ◽

pp. e1008256

Author(s):

Shuonan Chen ◽

Jackson Loper ◽

Xiaoyin Chen ◽

Alex Vaughan ◽

Anthony M. Zador ◽

...

Keyword(s):

State Of The Art ◽

Spatial Regression ◽

Image Data ◽

Real Data ◽

Optimization Methods ◽

Spatial Density ◽

Signal Recovery ◽

Cloud Platform ◽

Rna Transcripts ◽

Different Types

Modern spatial transcriptomics methods can target thousands of different types of RNA transcripts in a single slice of tissue. Many biological applications demand a high spatial density of transcripts relative to the imaging resolution, leading to partial mixing of transcript rolonies in many voxels; unfortunately, current analysis methods do not perform robustly in this highly-mixed setting. Here we develop a new analysis approach, BARcode DEmixing through Non-negative Spatial Regression (BarDensr): we start with a generative model of the physical process that leads to the observed image data and then apply sparse convex optimization methods to estimate the underlying (demixed) rolony densities. We apply BarDensr to simulated and real data and find that it achieves state of the art signal recovery, particularly in densely-labeled regions or data with low spatial resolution. Finally, BarDensr is fast and parallelizable. We provide open-source code as well as an implementation for the ‘NeuroCAAS’ cloud platform.

Download Full-text

An Adjoint Based Approach for Optimal Design of Morphing SMP

Volume 2: Mechanics and Behavior of Active Materials; Structural Health Monitoring; Bioinspired Smart Materials and Systems; Energy Harvesting ◽

10.1115/smasis2013-3250 ◽

2013 ◽

Author(s):

Shuang Wang ◽

John C. Brigham

Keyword(s):

Shape Change ◽

Mechanical Energy ◽

Optimization Methods ◽

Design Parameters ◽

Optimization Strategy ◽

Thermally Activated ◽

Gradient Based ◽

Nonlinear Optimization Methods ◽

Numerical Representations ◽

Objective Functional

This work presents a strategy to identify the optimal localized activation and actuation for a morphing thermally activated SMP structure or structural component to obtain a targeted shape change or set of shape features, subject to design objectives such as minimal total required energy and time. This strategy combines numerical representations of the SMP structure’s thermo-mechanical behavior subject to activation and actuation with gradient-based nonlinear optimization methods to solve the morphing inverse problem that includes minimizing cost functions which address thermal and mechanical energy, morphing time, and damage. In particular, the optimization strategy utilizes the adjoint method to efficiently compute the gradient of the objective functional(s) with respect to the design parameters for this coupled thermo-mechanical problem.

Download Full-text

Convergence Guarantees of Policy Optimization Methods for Markovian Jump Linear Systems

2020 American Control Conference (ACC) ◽

10.23919/acc45564.2020.9147571 ◽

2020 ◽

Cited By ~ 1

Author(s):

Joao Paulo Jansch-Porto ◽

Bin Hu ◽

Geir E. Dullerud

Keyword(s):

Linear Systems ◽

Optimization Methods ◽

Markovian Jump ◽

Jump Linear Systems ◽

Markovian Jump Linear Systems ◽

Policy Optimization

Download Full-text

A Review of State-of-the-art Control and Optimization Methods in Permanent Magnet Synchronous Machine Drives

2019 IEEE 17th International Conference on Industrial Informatics (INDIN) ◽

10.1109/indin41052.2019.8972257 ◽

2019 ◽

Cited By ~ 2

Author(s):

Zhanjun Tan ◽

Xiao-Zhi Gao

Keyword(s):

Permanent Magnet ◽

State Of The Art ◽

Optimization Methods ◽

Synchronous Machine ◽

Permanent Magnet Synchronous Machine ◽

Synchronous Machine Drives ◽

Machine Drives

Download Full-text

Exergoeconomic optimization of a thermal power plant using particle swarm optimization

Thermal Science ◽

10.2298/tsci120625213g ◽

2013 ◽

Vol 17 (2) ◽

pp. 509-524 ◽

Cited By ~ 6

Author(s):

Axel Groniewsky

Keyword(s):

Power Plant ◽

Energy Conversion ◽

Power Plants ◽

Optimization Problems ◽

State Of The Art ◽

Search Algorithm ◽

Thermal Power ◽

Optimization Methods ◽

Conversion System ◽

Energy Conversion System

The basic concept in applying numerical optimization methods for power plants optimization problems is to combine a State of the art search algorithm with a powerful, power plant simulation program to optimize the energy conversion system from both economic and thermodynamic viewpoints. Improving the energy conversion system by optimizing the design and operation and studying interactions among plant components requires the investigation of a large number of possible design and operational alternatives. State of the art search algorithms can assist in the development of cost-effective power plant concepts. The aim of this paper is to present how nature-inspired swarm intelligence (especially PSO) can be applied in the field of power plant optimization and how to find solutions for the problems arising and also to apply exergoeconomic optimization technics for thermal power plants.

Download Full-text