scholarly journals Policy Optimization with Second-Order Advantage Information

Author(s):  
Jiajin Li ◽  
Baoxiang Wang ◽  
Shengyu Zhang

Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide \& deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.

Author(s):  
Ziwei Luo ◽  
Jing Hu ◽  
Xin Wang ◽  
Siwei Lyu ◽  
Bin Kong ◽  
...  

Training a model-free deep reinforcement learning model to solve image-to-image translation is difficult since it involves high-dimensional continuous state and action spaces. In this paper, we draw inspiration from the recent success of the maximum entropy reinforcement learning framework designed for challenging continuous control problems to develop stochastic policies over high dimensional continuous spaces including image representation, generation, and control simultaneously. Central to this method is the Stochastic Actor-Executor-Critic (SAEC) which is an off-policy actor-critic model with an additional executor to generate realistic images. Specifically, the actor focuses on the high-level representation and control policy by a stochastic latent action, as well as explicitly directs the executor to generate low-level actions to manipulate the state. Experiments on several image-to-image translation tasks have demonstrated the effectiveness and robustness of the proposed SAEC when facing high-dimensional continuous space problems.


Energies ◽  
2021 ◽  
Vol 14 (8) ◽  
pp. 2120
Author(s):  
Ying Ji ◽  
Jianhui Wang ◽  
Jiacan Xu ◽  
Donglin Li

The proliferation of distributed renewable energy resources (RESs) poses major challenges to the operation of microgrids due to uncertainty. Traditional online scheduling approaches relying on accurate forecasts become difficult to implement due to the increase of uncertain RESs. Although several data-driven methods have been proposed recently to overcome the challenge, they generally suffer from a scalability issue due to the limited ability to optimize high-dimensional continuous control variables. To address these issues, we propose a data-driven online scheduling method for microgrid energy optimization based on continuous-control deep reinforcement learning (DRL). We formulate the online scheduling problem as a Markov decision process (MDP). The objective is to minimize the operating cost of the microgrid considering the uncertainty of RESs generation, load demand, and electricity prices. To learn the optimal scheduling strategy, a Gated Recurrent Unit (GRU)-based network is designed to extract temporal features of uncertainty and generate the optimal scheduling decisions in an end-to-end manner. To optimize the policy with high-dimensional and continuous actions, proximal policy optimization (PPO) is employed to train the neural network-based policy in a data-driven fashion. The proposed method does not require any forecasting information on the uncertainty or a prior knowledge of the physical model of the microgrid. Simulation results using realistic power system data of California Independent System Operator (CAISO) demonstrate the effectiveness of the proposed method.


Author(s):  
Emmanuel Ifeanyi Iroegbu ◽  
Devaraj Madhavi

Deep reinforcement learning has been successful in solving common autonomous driving tasks such as lane-keeping by simply using pixel data from the front view camera as input. However, raw pixel data contains a very high-dimensional observation that affects the learning quality of the agent due to the complexity imposed by a 'realistic' urban environment. Ergo, we investigate how compressing the raw pixel data from high-dimensional state to low-dimensional latent space offline using a variational autoencoder can significantly improve the training of a deep reinforcement learning agent. We evaluated our method on a simulated autonomous vehicle in car learning to act and compared our results with many baselines including deep deterministic policy gradient, proximal policy optimization, and soft actorcritic. The result shows that the method greatly accelerates the training time and there was a remarkable improvement in the quality of the deep reinforcement learning agent.


Author(s):  
Zhuobin Zheng ◽  
Chun Yuan ◽  
Zhihui Lin ◽  
Yangyang Cheng ◽  
Hanghao Wu

Deep Deterministic Policy Gradient (DDPG) algorithm has been successful for state-of-the-art performance in high-dimensional continuous control tasks. However, due to the complexity and randomness of the environment, DDPG tends to suffer from inefficient exploration and unstable training. In this work, we propose Self-Adaptive Double Bootstrapped DDPG (SOUP), an algorithm that extends DDPG to bootstrapped actor-critic architecture. SOUP improves the efficiency of exploration by multiple actor heads capturing more potential actions and multiple critic heads evaluating more reasonable Q-values collaboratively. The crux of double bootstrapped architecture is to tackle the fluctuations in performance, caused by multiple heads of spotty capacity varying throughout training. To alleviate the instability, a self-adaptive confidence mechanism is introduced to dynamically adjust the weights of bootstrapped heads and enhance the ensemble performance effectively and efficiently. We demonstrate that SOUP achieves faster learning by at least 45% while improving cumulative reward and stability substantially in comparison to vanilla DDPG on OpenAI Gym's MuJoCo environments.


2020 ◽  
Vol 34 (04) ◽  
pp. 5981-5988
Author(s):  
Yunhao Tang ◽  
Shipra Agrawal

In this work, we show that discretizing action space for continuous control is a simple yet powerful technique for on-policy optimization. The explosion in the number of discrete actions can be efficiently addressed by a policy with factorized distribution across action dimensions. We show that the discrete policy achieves significant performance gains with state-of-the-art on-policy optimization algorithms (PPO, TRPO, ACKTR) especially on high-dimensional tasks with complex dynamics. Additionally, we show that an ordinal parameterization of the discrete distribution can introduce the inductive bias that encodes the natural ordering between discrete actions. This ordinal architecture further significantly improves the performance of PPO/TRPO.


2021 ◽  
Vol 1 (2) ◽  
pp. 33-39
Author(s):  
Mónika Farsang ◽  
Luca Szegletes

Learning the optimal behavior is the ultimate goal in reinforcement learning. This can be achieved by many different approaches, the most successful of them are policy gradient methods. However, they can suffer from undesirably large updates of policies, leading to poor performance. In recent years there has been a clear trend toward designing more reliable algorithms. This paper addresses to examine different restriction strategies applied to the widely used Proximal Policy Optimization (PPO-Clip) technique. We also question whether the analyzed methods are able to adapt not only to low-dimensional tasks but also to complex, high-dimensional problems in control and robotic domains. The analysis of the learned behavior shows that these methods can lead to better performance compared to the original PPO-Clip algorithm, moreover, they are also able to achieve complex behavior and policies in high-dimensional environments.


2021 ◽  
pp. 1-8
Author(s):  
Andrea Rodríguez-Prat ◽  
Donna M. Wilson ◽  
Remei Agulles

Abstract Background/Objective Personal autonomy and control are major concepts for people with life-limiting conditions. Patients who express a wish to die (WTD) are often thought of wanting it because of loss of autonomy or control. The research conducted so far has not focused on personal beliefs and perspectives; and little is known about patients’ understanding of autonomy and control in this context. The aim of this review was to analyze what role autonomy and control may play in relation to the WTD expressed by people with life-limiting conditions. Methods A systematic integrative review was conducted. The search strategy used MeSH terms in combination with free-text searching of the EBSCO Discovery Service (which provides access to multiple academic library literature databases, including PubMed and CINAHL), as well as the large PsycINFO, Scopus, and Web of Science library literature databases from their inception until February 2019. The search was updated to January 2021. Results After the screening process, 85 full texts were included for the final analysis. Twenty-seven studies, recording the experiences of 1,824 participants, were identified. The studies were conducted in Australia (n = 5), Canada (n = 5), USA (n = 5), The Netherlands (n = 3), Spain (n = 2), Sweden (n = 2), Switzerland (n = 2), Finland (n = 1), Germany (n = 1), and the UK (n = 1). Three themes were identified: (1) the presence of autonomy for the WTD, (2) the different ways in which autonomy is conceptualized, and (3) the socio-cultural context of research participants. Significance of results Despite the importance given to the concept of autonomy in the WTD discourse, only a few empirical studies have focused on personal interests. Comprehending the context is crucial because personal understandings of autonomy are shaped by socio-cultural–ethical backgrounds and these impact personal WTD attitudes.


1983 ◽  
Vol 8 (2) ◽  
pp. 155-176

The purpose of these abstracts is to provide reference facilities in the management field. These abstracts have been sponsored by the Indian Council of Social Science Research. These abstracts cover books and articles on empirical studies, experiences of people involved in the management process, and concepts and theories based on Indian data and environment written by Indian or foreign authors and published in India or abroad. The following areas of management are covered: Financial Management, Management Accounting, and Control (FM) Marketing (M) Organization and Administration (OA) Personnel Management and Industrial Relations (PMIR) Production Management, Computers, and Operations Research (PMCOR) General Management: Environment, Policy, and Planning (GM) Policy, Planning, and Development (PPD) Books and articles published after January 1974 are covered in Vikalpa. Abstracts of publications between 1970 and 1973 have been published in two volumes by the Indian Institute of Management, Ahmedabad. For reprint of articles abstracted in Vikalpa please contact the original journals. For further details please write to Professor Shekhar Chaudhuri.


1994 ◽  
Vol 04 (04) ◽  
pp. 979-998 ◽  
Author(s):  
CHAI WAH WU ◽  
LEON O. CHUA

In this paper, we give a framework for synchronization of dynamical systems which unifies many results in synchronization and control of dynamical systems, in particular chaotic systems. We define concepts such as asymptotical synchronization, partial synchronization and synchronization error bounds. We show how asymptotical synchronization is related to asymptotical stability. The main tool we use to prove asymptotical stability and synchronization is Lyapunov stability theory. We illustrate how many previous results on synchronization and control of chaotic systems can be derived from this framework. We will also give a characterization of robustness of synchronization and show that master-slave asymptotical synchronization in Chua’s oscillator is robust.


Sign in / Sign up

Export Citation Format

Share Document