Self-Adaptive Double Bootstrapped DDPG

Deep Deterministic Policy Gradient (DDPG) algorithm has been successful for state-of-the-art performance in high-dimensional continuous control tasks. However, due to the complexity and randomness of the environment, DDPG tends to suffer from inefficient exploration and unstable training. In this work, we propose Self-Adaptive Double Bootstrapped DDPG (SOUP), an algorithm that extends DDPG to bootstrapped actor-critic architecture. SOUP improves the efficiency of exploration by multiple actor heads capturing more potential actions and multiple critic heads evaluating more reasonable Q-values collaboratively. The crux of double bootstrapped architecture is to tackle the fluctuations in performance, caused by multiple heads of spotty capacity varying throughout training. To alleviate the instability, a self-adaptive confidence mechanism is introduced to dynamically adjust the weights of bootstrapped heads and enhance the ensemble performance effectively and efficiently. We demonstrate that SOUP achieves faster learning by at least 45% while improving cumulative reward and stability substantially in comparison to vanilla DDPG on OpenAI Gym's MuJoCo environments.

Download Full-text

Deterministic Value-Policy Gradients

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5732 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3316-3323

Author(s):

Qingpeng Cai ◽

Ling Pan ◽

Pingzhong Tang

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Infinite Horizon ◽

Gradient Algorithm ◽

Continuous Control ◽

Model Bias ◽

Model Free ◽

Policy Gradient ◽

Analytical Gradients

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Download Full-text

Policy Search by Target Distribution Learning for Continuous Control

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6156 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6770-6777

Author(s):

Chuheng Zhang ◽

Yuanqi Li ◽

Jian Li

Keyword(s):

State Of The Art ◽

Gradient Methods ◽

Continuous Control ◽

Policy Network ◽

Current Policy ◽

Training Process ◽

Target Distribution ◽

Policy Gradient ◽

And Training ◽

Better Than

It is known that existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to deterministic, leading to an unstable training process. We show that such instability can happen even in a very simple environment. To address this issue, we propose a new method, called target distribution learning (TDL), for policy improvement in reinforcement learning. TDL alternates between proposing a target distribution and training the policy network to approach the target distribution. TDL is more effective in constraining the KL divergence between updated policies, and hence leads to more stable policy improvements over iterations. Our experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training.

Download Full-text

The Successful Ingredients of Policy Gradient Algorithms

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/338 ◽

2021 ◽

Author(s):

Sven Gronauer ◽

Martin Gottwald ◽

Klaus Diepold

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Gradient Methods ◽

Gradient Algorithms ◽

The Sublime ◽

Policy Gradient ◽

Art Performance ◽

Underlying Mechanisms

Despite the sublime success in recent years, the underlying mechanisms powering the advances of reinforcement learning are yet poorly understood. In this paper, we identify these mechanisms - which we call ingredients - in on-policy policy gradient methods and empirically determine their impact on the learning. To allow an equitable assessment, we conduct our experiments based on a unified and modular implementation. Our results underline the significance of recent algorithmic advances and demonstrate that reaching state-of-the-art performance may not need sophisticated algorithms but can also be accomplished by the combination of a few simple ingredients.

Download Full-text

Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/604 ◽

2018 ◽

Cited By ~ 16

Author(s):

Tao Shen ◽

Tianyi Zhou ◽

Guodong Long ◽

Jing Jiang ◽

Sen Wang ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Mutual Benefit ◽

Fusion Model ◽

Attention Network ◽

Sequence Modeling ◽

Policy Gradient ◽

Art Performance

Many natural language processing tasks solely rely on sparse dependencies between a few tokens in a sentence. Soft attention mechanisms show promising performance in modeling local/global dependencies by soft probabilities between every two tokens, but they are not effective and efficient when applied to long sentences. By contrast, hard attention mechanisms directly select a subset of tokens but are difficult and inefficient to train due to their combinatorial nature. In this paper, we integrate both soft and hard attention into one context fusion model, "reinforced self-attention (ReSA)", for the mutual benefit of each other. In ReSA, a hard attention trims a sequence for a soft self-attention to process, while the soft attention feeds reward signals back to facilitate the training of the hard one. For this purpose, we develop a novel hard attention called "reinforced sequence sampling (RSS)", selecting tokens in parallel and trained via policy gradient. Using two RSS modules, ReSA efficiently extracts the sparse dependencies between each pair of selected tokens. We finally propose an RNN/CNN-free sentence-encoding model, "reinforced self-attention network (ReSAN)", solely based on ReSA. It achieves state-of-the-art performance on both the Stanford Natural Language Inference (SNLI) and the Sentences Involving Compositional Knowledge (SICK) datasets.

Download Full-text

Policy Optimization with Second-Order Advantage Information

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/699 ◽

2018 ◽

Cited By ~ 1

Author(s):

Jiajin Li ◽

Baoxiang Wang ◽

Shengyu Zhang

Keyword(s):

Empirical Studies ◽

Second Order ◽

High Dimensional ◽

Continuous Control ◽

Unified Framework ◽

Performance Improvements ◽

Factorization Structure ◽

Policy Gradient ◽

Policy Optimization ◽

And Control

Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide \& deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.

Download Full-text

Discretizing Continuous Action Space for On-Policy Optimization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6059 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5981-5988

Author(s):

Yunhao Tang ◽

Shipra Agrawal

Keyword(s):

Complex Dynamics ◽

State Of The Art ◽

Discrete Distribution ◽

Action Space ◽

High Dimensional ◽

Continuous Control ◽

Continuous Action ◽

Significant Performance ◽

Policy Optimization ◽

Performance Gains

In this work, we show that discretizing action space for continuous control is a simple yet powerful technique for on-policy optimization. The explosion in the number of discrete actions can be efficiently addressed by a policy with factorized distribution across action dimensions. We show that the discrete policy achieves significant performance gains with state-of-the-art on-policy optimization algorithms (PPO, TRPO, ACKTR) especially on high-dimensional tasks with complex dynamics. Additionally, we show that an ordinal parameterization of the discrete distribution can introduce the inductive bias that encodes the natural ordering between discrete actions. This ordinal architecture further significantly improves the performance of PPO/TRPO.

Download Full-text

Control of Shared Energy Storage Assets Within Building Clusters Using Reinforcement Learning

Volume 2A: 44th Design Automation Conference ◽

10.1115/detc2018-86094 ◽

2018 ◽

Cited By ~ 2

Author(s):

Philip Odonkor ◽

Kemper Lewis

Keyword(s):

Reinforcement Learning ◽

Energy Storage ◽

State Of The Art ◽

Continuous Control ◽

Battery System ◽

Current State ◽

Policy Gradient ◽

Energy Assets ◽

The Impact ◽

Continuous Domain

This work leverages the current state of the art in reinforcement learning for continuous control, the Deep Deterministic Policy Gradient (DDPG) algorithm, towards the optimal 24-hour dispatch of shared energy assets within building clusters. The modeled DDPG agent interacts with a battery environment, designed to emulate a shared battery system. The aim here is to not only learn an efficient charged/discharged policy, but to also address the continuous domain question of how much energy should be charged or discharged. Experimentally, we examine the impact of the learned dispatch strategy towards minimizing demand peaks within the building cluster. Our results show that across the variety of building cluster combinations studied, the algorithm is able to learn and exploit energy arbitrage, tailoring it into battery dispatch strategies for peak demand shifting.

Download Full-text

Review of the Applications of Deep Learning in Bioinformatics

Current Bioinformatics ◽

10.2174/1574893615999200711165743 ◽

2021 ◽

Vol 15 (8) ◽

pp. 898-911

Author(s):

Yongqing Zhang ◽

Jianrong Yan ◽

Siyu Chen ◽

Meiqin Gong ◽

Dongrui Gao ◽

...

Keyword(s):

Deep Learning ◽

Drug Discovery ◽

Biomedical Imaging ◽

State Of The Art ◽

Black Box ◽

Medical Data ◽

Biological Data ◽

High Dimensional ◽

Biological Research ◽

Process Data

Rapid advances in biological research over recent years have significantly enriched biological and medical data resources. Deep learning-based techniques have been successfully utilized to process data in this field, and they have exhibited state-of-the-art performances even on high-dimensional, nonstructural, and black-box biological data. The aim of the current study is to provide an overview of the deep learning-based techniques used in biology and medicine and their state-of-the-art applications. In particular, we introduce the fundamentals of deep learning and then review the success of applying such methods to bioinformatics, biomedical imaging, biomedicine, and drug discovery. We also discuss the challenges and limitations of this field, and outline possible directions for further research.

Download Full-text

Efficient Rank-Based Diffusion Process with Assured Convergence

Journal of Imaging ◽

10.3390/jimaging7030049 ◽

2021 ◽

Vol 7 (3) ◽

pp. 49

Author(s):

Daniel Carlos Guimarães Pedronette ◽

Lucas Pascotti Valem ◽

Longin Jan Latecki

Keyword(s):

Diffusion Process ◽

Learning Strategies ◽

State Of The Art ◽

Representation Learning ◽

Theoretical Background ◽

High Dimensional ◽

Visual Features ◽

Learning Approaches ◽

Previous Decade ◽

Asymptotic Complexity

Visual features and representation learning strategies experienced huge advances in the previous decade, mainly supported by deep learning approaches. However, retrieval tasks are still performed mainly based on traditional pairwise dissimilarity measures, while the learned representations lie on high dimensional manifolds. With the aim of going beyond pairwise analysis, post-processing methods have been proposed to replace pairwise measures by globally defined measures, capable of analyzing collections in terms of the underlying data manifold. The most representative approaches are diffusion and ranked-based methods. While the diffusion approaches can be computationally expensive, the rank-based methods lack theoretical background. In this paper, we propose an efficient Rank-based Diffusion Process which combines both approaches and avoids the drawbacks of each one. The obtained method is capable of efficiently approximating a diffusion process by exploiting rank-based information, while assuring its convergence. The algorithm exhibits very low asymptotic complexity and can be computed regionally, being suitable to outside of dataset queries. An experimental evaluation conducted for image retrieval and person re-ID tasks on diverse datasets demonstrates the effectiveness of the proposed approach with results comparable to the state-of-the-art.

Download Full-text

PyConvU-Net: a lightweight and multiscale network for biomedical image segmentation

BMC Bioinformatics ◽

10.1186/s12859-020-03943-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Changyong Li ◽

Yongxian Fan ◽

Xiaodong Cai

Keyword(s):

Image Segmentation ◽

Deep Learning ◽

State Of The Art ◽

Experimental Results ◽

Actual Situation ◽

Controlled Experiments ◽

Biomedical Image ◽

Segmentation Methods ◽

Art Performance

Abstract Background With the development of deep learning (DL), more and more methods based on deep learning are proposed and achieve state-of-the-art performance in biomedical image segmentation. However, these methods are usually complex and require the support of powerful computing resources. According to the actual situation, it is impractical that we use huge computing resources in clinical situations. Thus, it is significant to develop accurate DL based biomedical image segmentation methods which depend on resources-constraint computing. Results A lightweight and multiscale network called PyConvU-Net is proposed to potentially work with low-resources computing. Through strictly controlled experiments, PyConvU-Net predictions have a good performance on three biomedical image segmentation tasks with the fewest parameters. Conclusions Our experimental results preliminarily demonstrate the potential of proposed PyConvU-Net in biomedical image segmentation with resources-constraint computing.

Download Full-text