scholarly journals Discretizing Continuous Action Space for On-Policy Optimization

2020 ◽  
Vol 34 (04) ◽  
pp. 5981-5988
Author(s):  
Yunhao Tang ◽  
Shipra Agrawal

In this work, we show that discretizing action space for continuous control is a simple yet powerful technique for on-policy optimization. The explosion in the number of discrete actions can be efficiently addressed by a policy with factorized distribution across action dimensions. We show that the discrete policy achieves significant performance gains with state-of-the-art on-policy optimization algorithms (PPO, TRPO, ACKTR) especially on high-dimensional tasks with complex dynamics. Additionally, we show that an ordinal parameterization of the discrete distribution can introduce the inductive bias that encodes the natural ordering between discrete actions. This ordinal architecture further significantly improves the performance of PPO/TRPO.

2021 ◽  
Author(s):  
Subramanian Sankaranarayanan ◽  
Sukriti Manna ◽  
Troy Loeffler ◽  
Rohit Batra ◽  
Suvo Banik ◽  
...  

Abstract Reinforcement learning (RL) approaches that combine a tree search with deep learning have found remarkable success in searching exorbitantly large, albeit discrete action spaces, as demonstrated recently in board games like chess, Shogi, and Go. Many real-world materials discovery and design applications, however, involve multi-dimensional search problems and learning domains that have continuous action spaces. Exploring high-dimensional potential energy surfaces (PES) of materials to represent inter- and intra-molecular interactions, for example, involves a continuous action search to find optimal potential parameters or coefficients. Traditionally, these searches are time consuming (often several years for a single system) and have been driven by human intuition and/or expertise and more recently by global/local optimization searches that have issues with convergence and/or do not scale well with the search dimensionality. Here, in a departure from discrete action and other gradient-based approaches, we introduce a RL strategy based on decision trees that incorporates modified rewards for improved exploration, efficient sampling during playouts, and a “window scaling scheme” for enhanced exploitation, to enable efficient and scalable search for continuous action space problems. Using high-dimensional artificial landscapes and control RL problems, we successfully benchmark our approach against popular global optimization schemes and state-of-the-art policy gradient methods, respectively. We further demonstrate its efficacy to perform high-throughput PES search for 54 different elemental systems across the Periodic table, in- including alkali, alkaline-earth, transition metals, metalloids, as well as non-metals. Using a well-sampled (∼165,000 configurations) first-principles derived training and test dataset, we demonstrate that the new class of RL trained bond-order potentials capture the size-dependent energetic landscape from few atom clusters to bulk (energy errors << 200 meV/atom over a 3-6 eV sampled range) as well as their dynamics (force errors << 0.5 eV/A over a 50-100 eV/A range). We analyze the error trends across different elements in the latent space and trace their origin to elemental structural diversity and the smoothness of the element energy surface. Finally, we run molecular dynamics using these RL trained potentials and perform a comprehensive test of dynamic stability of more than 40,000 clusters sampled for different elements across the Periodic table. Our newly developed high-quality potentials will enable accelerated nanoscale materials design and discovery. Broadly, our RL strategy will be applicable to many other physical science problems involving search over continuous action spaces.


2020 ◽  
Vol 34 (07) ◽  
pp. 10460-10469 ◽  
Author(s):  
Ankan Bansal ◽  
Sai Saketh Rambhatla ◽  
Abhinav Shrivastava ◽  
Rama Chellappa

We present an approach for detecting human-object interactions (HOIs) in images, based on the idea that humans interact with functionally similar objects in a similar manner. The proposed model is simple and efficiently uses the data, visual features of the human, relative spatial orientation of the human and the object, and the knowledge that functionally similar objects take part in similar interactions with humans. We provide extensive experimental validation for our approach and demonstrate state-of-the-art results for HOI detection. On the HICO-Det dataset our method achieves a gain of over 2.5% absolute points in mean average precision (mAP) over state-of-the-art. We also show that our approach leads to significant performance gains for zero-shot HOI detection in the seen object setting. We further demonstrate that using a generic object detector, our model can generalize to interactions involving previously unseen objects.


Energies ◽  
2021 ◽  
Vol 14 (8) ◽  
pp. 2120
Author(s):  
Ying Ji ◽  
Jianhui Wang ◽  
Jiacan Xu ◽  
Donglin Li

The proliferation of distributed renewable energy resources (RESs) poses major challenges to the operation of microgrids due to uncertainty. Traditional online scheduling approaches relying on accurate forecasts become difficult to implement due to the increase of uncertain RESs. Although several data-driven methods have been proposed recently to overcome the challenge, they generally suffer from a scalability issue due to the limited ability to optimize high-dimensional continuous control variables. To address these issues, we propose a data-driven online scheduling method for microgrid energy optimization based on continuous-control deep reinforcement learning (DRL). We formulate the online scheduling problem as a Markov decision process (MDP). The objective is to minimize the operating cost of the microgrid considering the uncertainty of RESs generation, load demand, and electricity prices. To learn the optimal scheduling strategy, a Gated Recurrent Unit (GRU)-based network is designed to extract temporal features of uncertainty and generate the optimal scheduling decisions in an end-to-end manner. To optimize the policy with high-dimensional and continuous actions, proximal policy optimization (PPO) is employed to train the neural network-based policy in a data-driven fashion. The proposed method does not require any forecasting information on the uncertainty or a prior knowledge of the physical model of the microgrid. Simulation results using realistic power system data of California Independent System Operator (CAISO) demonstrate the effectiveness of the proposed method.


2021 ◽  
Vol 8 ◽  
Author(s):  
Zishang Kong ◽  
Min He ◽  
Qianjiang Luo ◽  
Xiansong Huang ◽  
Pengxu Wei ◽  
...  

Capsule endoscopy is a leading diagnostic tool for small bowel lesions which faces certain challenges such as time-consuming interpretation and harsh optical environment inside the small intestine. Specialists unavoidably waste lots of time on searching for a high clearness degree image for accurate diagnostics. However, current clearness degree classification methods are based on either traditional attributes or an unexplainable deep neural network. In this paper, we propose a multi-task framework, called the multi-task classification and segmentation network (MTCSN), to achieve joint learning of clearness degree (CD) and tissue semantic segmentation (TSS) for the first time. In the MTCSN, the CD helps to generate better refined TSS, while TSS provides an explicable semantic map to better classify the CD. In addition, we present a new benchmark, named the Capsule-Endoscopy Crohn’s Disease dataset, which introduces the challenges faced in the real world including motion blur, excreta occlusion, reflection, and various complex alimentary scenes that are widely acknowledged in endoscopy examination. Extensive experiments and ablation studies report the significant performance gains of the MTCSN over state-of-the-art methods.


Author(s):  
Paul D. Wilcox ◽  
Anthony J. Croxford ◽  
Nicolas Budyn ◽  
Rhodri L. T. Bevan ◽  
Jie Zhang ◽  
...  

State-of-the-art ultrasonic non-destructive evaluation (NDE) uses an array to rapidly generate multiple, information-rich views at each test position on a safety-critical component. However, the information for detecting potential defects is dispersed across views, and a typical inspection may involve thousands of test positions. Interpretation requires painstaking analysis by a skilled operator. In this paper, various methods for fusing multi-view data are developed. Compared with any one single view, all methods are shown to yield significant performance gains, which may be related to the general and edge cases for NDE. In the general case, a defect is clearly detectable in at least one individual view, but the view(s) depends on the defect location and orientation. Here, the performance gain from data fusion is mainly the result of the selective use of information from the most appropriate view(s) and fusion provides a means to substantially reduce operator burden. The edge cases are defects that cannot be reliably detected in any one individual view without false alarms. Here, certain fusion methods are shown to enable detection with reduced false alarms. In this context, fusion allows NDE capability to be extended with potential implications for the design and operation of engineering assets.


Author(s):  
Zhuobin Zheng ◽  
Chun Yuan ◽  
Zhihui Lin ◽  
Yangyang Cheng ◽  
Hanghao Wu

Deep Deterministic Policy Gradient (DDPG) algorithm has been successful for state-of-the-art performance in high-dimensional continuous control tasks. However, due to the complexity and randomness of the environment, DDPG tends to suffer from inefficient exploration and unstable training. In this work, we propose Self-Adaptive Double Bootstrapped DDPG (SOUP), an algorithm that extends DDPG to bootstrapped actor-critic architecture. SOUP improves the efficiency of exploration by multiple actor heads capturing more potential actions and multiple critic heads evaluating more reasonable Q-values collaboratively. The crux of double bootstrapped architecture is to tackle the fluctuations in performance, caused by multiple heads of spotty capacity varying throughout training. To alleviate the instability, a self-adaptive confidence mechanism is introduced to dynamically adjust the weights of bootstrapped heads and enhance the ensemble performance effectively and efficiently. We demonstrate that SOUP achieves faster learning by at least 45% while improving cumulative reward and stability substantially in comparison to vanilla DDPG on OpenAI Gym's MuJoCo environments.


Author(s):  
Jiajin Li ◽  
Baoxiang Wang ◽  
Shengyu Zhang

Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide \& deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.


Author(s):  
Nian-Ze Lee ◽  
Yen-Shi Wang ◽  
Jie-Hong R. Jiang

Stochastic Boolean satisfiability (SSAT) is an expressive language to formulate decision problems with randomness. Solving SSAT formulas has the same PSPACE-complete computational complexity as solving quantified Boolean formulas (QBFs). Despite its broad applications and profound theoretical values, SSAT has received relatively little attention compared to QBF. In this paper, we focus on exist-random quantified SSAT formulas, also known as E-MAJSAT, which is a special fragment of SSAT commonly applied in probabilistic conformant planning, posteriori hypothesis, and maximum expected utility. Based on clause selection, a recently proposed QBF technique, we propose an algorithm to solve E-MAJSAT. Moreover, our method can provide an approximate solution to E-MAJSAT with a lower bound when an exact answer is too expensive to compute. Experiments show that the proposed algorithm achieves significant performance gains and memory savings over the state-of-the-art SSAT solvers on a number of benchmark formulas, and provides useful lower bounds for cases where prior methods fail to compute exact answers.


2021 ◽  
Vol 15 (8) ◽  
pp. 898-911
Author(s):  
Yongqing Zhang ◽  
Jianrong Yan ◽  
Siyu Chen ◽  
Meiqin Gong ◽  
Dongrui Gao ◽  
...  

Rapid advances in biological research over recent years have significantly enriched biological and medical data resources. Deep learning-based techniques have been successfully utilized to process data in this field, and they have exhibited state-of-the-art performances even on high-dimensional, nonstructural, and black-box biological data. The aim of the current study is to provide an overview of the deep learning-based techniques used in biology and medicine and their state-of-the-art applications. In particular, we introduce the fundamentals of deep learning and then review the success of applying such methods to bioinformatics, biomedical imaging, biomedicine, and drug discovery. We also discuss the challenges and limitations of this field, and outline possible directions for further research.


Sign in / Sign up

Export Citation Format

Share Document