Policy Optimization with Second-Order Advantage Information

Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide \& deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.

Download Full-text

Stochastic Actor-Executor-Critic for Image-to-Image Translation

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/382 ◽

2021 ◽

Author(s):

Ziwei Luo ◽

Jing Hu ◽

Xin Wang ◽

Siwei Lyu ◽

Bin Kong ◽

...

Keyword(s):

Reinforcement Learning ◽

Control Policy ◽

High Dimensional ◽

Continuous Control ◽

Continuous Space ◽

Model Free ◽

Recent Success ◽

Image Translation ◽

Continuous State ◽

And Control

Training a model-free deep reinforcement learning model to solve image-to-image translation is difficult since it involves high-dimensional continuous state and action spaces. In this paper, we draw inspiration from the recent success of the maximum entropy reinforcement learning framework designed for challenging continuous control problems to develop stochastic policies over high dimensional continuous spaces including image representation, generation, and control simultaneously. Central to this method is the Stochastic Actor-Executor-Critic (SAEC) which is an off-policy actor-critic model with an additional executor to generate realistic images. Specifically, the actor focuses on the high-level representation and control policy by a stochastic latent action, as well as explicitly directs the executor to generate low-level actions to manipulate the state. Experiments on several image-to-image translation tasks have demonstrated the effectiveness and robustness of the proposed SAEC when facing high-dimensional continuous space problems.

Download Full-text

Data-Driven Online Energy Scheduling of a Microgrid Based on Deep Reinforcement Learning

Energies ◽

10.3390/en14082120 ◽

2021 ◽

Vol 14 (8) ◽

pp. 2120

Author(s):

Ying Ji ◽

Jianhui Wang ◽

Jiacan Xu ◽

Donglin Li

Keyword(s):

Reinforcement Learning ◽

Operating Cost ◽

Online Scheduling ◽

Optimal Scheduling ◽

Data Driven ◽

High Dimensional ◽

Continuous Control ◽

Renewable Energy Resources ◽

Continuous Actions ◽

Policy Optimization

The proliferation of distributed renewable energy resources (RESs) poses major challenges to the operation of microgrids due to uncertainty. Traditional online scheduling approaches relying on accurate forecasts become difficult to implement due to the increase of uncertain RESs. Although several data-driven methods have been proposed recently to overcome the challenge, they generally suffer from a scalability issue due to the limited ability to optimize high-dimensional continuous control variables. To address these issues, we propose a data-driven online scheduling method for microgrid energy optimization based on continuous-control deep reinforcement learning (DRL). We formulate the online scheduling problem as a Markov decision process (MDP). The objective is to minimize the operating cost of the microgrid considering the uncertainty of RESs generation, load demand, and electricity prices. To learn the optimal scheduling strategy, a Gated Recurrent Unit (GRU)-based network is designed to extract temporal features of uncertainty and generate the optimal scheduling decisions in an end-to-end manner. To optimize the policy with high-dimensional and continuous actions, proximal policy optimization (PPO) is employed to train the neural network-based policy in a data-driven fashion. The proposed method does not require any forecasting information on the uncertainty or a prior knowledge of the physical model of the microgrid. Simulation results using realistic power system data of California Independent System Operator (CAISO) demonstrate the effectiveness of the proposed method.

Download Full-text

Accelerating the training of deep reinforcement learning in autonomous driving

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i3.pp649-656 ◽

2021 ◽

Vol 10 (3) ◽

pp. 649

Author(s):

Emmanuel Ifeanyi Iroegbu ◽

Devaraj Madhavi

Keyword(s):

Reinforcement Learning ◽

Autonomous Vehicle ◽

Autonomous Driving ◽

High Dimensional ◽

Training Time ◽

Learning Agent ◽

Policy Gradient ◽

Low Dimensional ◽

Policy Optimization

Deep reinforcement learning has been successful in solving common autonomous driving tasks such as lane-keeping by simply using pixel data from the front view camera as input. However, raw pixel data contains a very high-dimensional observation that affects the learning quality of the agent due to the complexity imposed by a 'realistic' urban environment. Ergo, we investigate how compressing the raw pixel data from high-dimensional state to low-dimensional latent space offline using a variational autoencoder can significantly improve the training of a deep reinforcement learning agent. We evaluated our method on a simulated autonomous vehicle in car learning to act and compared our results with many baselines including deep deterministic policy gradient, proximal policy optimization, and soft actorcritic. The result shows that the method greatly accelerates the training time and there was a remarkable improvement in the quality of the deep reinforcement learning agent.

Download Full-text

Self-Adaptive Double Bootstrapped DDPG

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/444 ◽

2018 ◽

Cited By ~ 1

Author(s):

Zhuobin Zheng ◽

Chun Yuan ◽

Zhihui Lin ◽

Yangyang Cheng ◽

Hanghao Wu

Keyword(s):

State Of The Art ◽

High Dimensional ◽

Continuous Control ◽

Ensemble Performance ◽

Policy Gradient ◽

Art Performance ◽

Q Values ◽

Self Adaptive

Deep Deterministic Policy Gradient (DDPG) algorithm has been successful for state-of-the-art performance in high-dimensional continuous control tasks. However, due to the complexity and randomness of the environment, DDPG tends to suffer from inefficient exploration and unstable training. In this work, we propose Self-Adaptive Double Bootstrapped DDPG (SOUP), an algorithm that extends DDPG to bootstrapped actor-critic architecture. SOUP improves the efficiency of exploration by multiple actor heads capturing more potential actions and multiple critic heads evaluating more reasonable Q-values collaboratively. The crux of double bootstrapped architecture is to tackle the fluctuations in performance, caused by multiple heads of spotty capacity varying throughout training. To alleviate the instability, a self-adaptive confidence mechanism is introduced to dynamically adjust the weights of bootstrapped heads and enhance the ensemble performance effectively and efficiently. We demonstrate that SOUP achieves faster learning by at least 45% while improving cumulative reward and stability substantially in comparison to vanilla DDPG on OpenAI Gym's MuJoCo environments.

Download Full-text

A monotonic policy optimization algorithm for high-dimensional continuous control problem in 3D MuJoCo

Multimedia Tools and Applications ◽

10.1007/s11042-018-6098-y ◽

2018 ◽

Vol 78 (20) ◽

pp. 28665-28680

Author(s):

Qunyong Yuan ◽

Nanfeng Xiao

Keyword(s):

Control Problem ◽

Optimization Algorithm ◽

High Dimensional ◽

Continuous Control ◽

Policy Optimization

Download Full-text

Discretizing Continuous Action Space for On-Policy Optimization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6059 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5981-5988

Author(s):

Yunhao Tang ◽

Shipra Agrawal

Keyword(s):

Complex Dynamics ◽

State Of The Art ◽

Discrete Distribution ◽

Action Space ◽

High Dimensional ◽

Continuous Control ◽

Continuous Action ◽

Significant Performance ◽

Policy Optimization ◽

Performance Gains

In this work, we show that discretizing action space for continuous control is a simple yet powerful technique for on-policy optimization. The explosion in the number of discrete actions can be efficiently addressed by a policy with factorized distribution across action dimensions. We show that the discrete policy achieves significant performance gains with state-of-the-art on-policy optimization algorithms (PPO, TRPO, ACKTR) especially on high-dimensional tasks with complex dynamics. Additionally, we show that an ordinal parameterization of the discrete distribution can introduce the inductive bias that encodes the natural ordering between discrete actions. This ordinal architecture further significantly improves the performance of PPO/TRPO.

Download Full-text

Controlling Agents by Constrained Policy Updates

SYSTEM THEORY, CONTROL AND COMPUTING JOURNAL ◽

10.52846/stccj.2021.1.2.24 ◽

2021 ◽

Vol 1 (2) ◽

pp. 33-39

Author(s):

Mónika Farsang ◽

Luca Szegletes

Keyword(s):

Gradient Methods ◽

Poor Performance ◽

High Dimensional ◽

Complex Behavior ◽

Clear Trend ◽

Learned Behavior ◽

Optimal Behavior ◽

Policy Gradient ◽

Low Dimensional ◽

Policy Optimization

Learning the optimal behavior is the ultimate goal in reinforcement learning. This can be achieved by many different approaches, the most successful of them are policy gradient methods. However, they can suffer from undesirably large updates of policies, leading to poor performance. In recent years there has been a clear trend toward designing more reliable algorithms. This paper addresses to examine different restriction strategies applied to the widely used Proximal Policy Optimization (PPO-Clip) technique. We also question whether the analyzed methods are able to adapt not only to low-dimensional tasks but also to complex, high-dimensional problems in control and robotic domains. The analysis of the learned behavior shows that these methods can lead to better performance compared to the original PPO-Clip algorithm, moreover, they are also able to achieve complex behavior and policies in high-dimensional environments.

Download Full-text

Autonomy and control in the wish to die in terminally ill patients: A systematic integrative review

Palliative & Supportive Care ◽

10.1017/s1478951521000985 ◽

2021 ◽

pp. 1-8

Author(s):

Andrea Rodríguez-Prat ◽

Donna M. Wilson ◽

Remei Agulles

Keyword(s):

Cultural Context ◽

Personal Autonomy ◽

Empirical Studies ◽

Academic Library ◽

Integrative Review ◽

Free Text ◽

Screening Process ◽

Mesh Terms ◽

Wish To Die ◽

And Control

Abstract Background/Objective Personal autonomy and control are major concepts for people with life-limiting conditions. Patients who express a wish to die (WTD) are often thought of wanting it because of loss of autonomy or control. The research conducted so far has not focused on personal beliefs and perspectives; and little is known about patients’ understanding of autonomy and control in this context. The aim of this review was to analyze what role autonomy and control may play in relation to the WTD expressed by people with life-limiting conditions. Methods A systematic integrative review was conducted. The search strategy used MeSH terms in combination with free-text searching of the EBSCO Discovery Service (which provides access to multiple academic library literature databases, including PubMed and CINAHL), as well as the large PsycINFO, Scopus, and Web of Science library literature databases from their inception until February 2019. The search was updated to January 2021. Results After the screening process, 85 full texts were included for the final analysis. Twenty-seven studies, recording the experiences of 1,824 participants, were identified. The studies were conducted in Australia (n = 5), Canada (n = 5), USA (n = 5), The Netherlands (n = 3), Spain (n = 2), Sweden (n = 2), Switzerland (n = 2), Finland (n = 1), Germany (n = 1), and the UK (n = 1). Three themes were identified: (1) the presence of autonomy for the WTD, (2) the different ways in which autonomy is conceptualized, and (3) the socio-cultural context of research participants. Significance of results Despite the importance given to the concept of autonomy in the WTD discourse, only a few empirical studies have focused on personal interests. Comprehending the context is crucial because personal understandings of autonomy are shaped by socio-cultural–ethical backgrounds and these impact personal WTD attitudes.

Download Full-text

Indian management research

Vikalpa The Journal for Decision Makers ◽

10.1177/0256090919830209 ◽

1983 ◽

Vol 8 (2) ◽

pp. 155-176

Keyword(s):

Financial Management ◽

Production Management ◽

Empirical Studies ◽

Industrial Relations ◽

Social Science Research ◽

Science Research ◽

Environment Policy ◽

General Management ◽

Policy And Planning ◽

And Control

The purpose of these abstracts is to provide reference facilities in the management field. These abstracts have been sponsored by the Indian Council of Social Science Research. These abstracts cover books and articles on empirical studies, experiences of people involved in the management process, and concepts and theories based on Indian data and environment written by Indian or foreign authors and published in India or abroad. The following areas of management are covered: Financial Management, Management Accounting, and Control (FM) Marketing (M) Organization and Administration (OA) Personnel Management and Industrial Relations (PMIR) Production Management, Computers, and Operations Research (PMCOR) General Management: Environment, Policy, and Planning (GM) Policy, Planning, and Development (PPD) Books and articles published after January 1974 are covered in Vikalpa. Abstracts of publications between 1970 and 1973 have been published in two volumes by the Indian Institute of Management, Ahmedabad. For reprint of articles abstracted in Vikalpa please contact the original journals. For further details please write to Professor Shekhar Chaudhuri.

Download Full-text

A UNIFIED FRAMEWORK FOR SYNCHRONIZATION AND CONTROL OF DYNAMICAL SYSTEMS

International Journal of Bifurcation and Chaos ◽

10.1142/s0218127494000691 ◽

1994 ◽

Vol 04 (04) ◽

pp. 979-998 ◽

Cited By ~ 251

Author(s):

CHAI WAH WU ◽

LEON O. CHUA

Keyword(s):

Dynamical Systems ◽

Chaotic Systems ◽

Main Tool ◽

Lyapunov Stability Theory ◽

Asymptotical Stability ◽

Unified Framework ◽

Partial Synchronization ◽

Chua’S Oscillator ◽

And Control

In this paper, we give a framework for synchronization of dynamical systems which unifies many results in synchronization and control of dynamical systems, in particular chaotic systems. We define concepts such as asymptotical synchronization, partial synchronization and synchronization error bounds. We show how asymptotical synchronization is related to asymptotical stability. The main tool we use to prove asymptotical stability and synchronization is Lyapunov stability theory. We illustrate how many previous results on synchronization and control of chaotic systems can be derived from this framework. We will also give a characterization of robustness of synchronization and show that master-slave asymptotical synchronization in Chua’s oscillator is robust.

Download Full-text