A Deep Reinforcement Learning Algorithm Based on Tetanic Stimulation and Amnesic Mechanisms for Continuous Control of Multi-DOF Manipulator

Deep Reinforcement Learning (DRL) has been an active research area in view of its capability in solving large-scale control problems. Until presently, many algorithms have been developed, such as Deep Deterministic Policy Gradient (DDPG), Twin-Delayed Deep Deterministic Policy Gradient (TD3), and so on. However, the converging achievement of DRL often requires extensive collected data sets and training episodes, which is data inefficient and computing resource consuming. Motivated by the above problem, in this paper, we propose a Twin-Delayed Deep Deterministic Policy Gradient algorithm with a Rebirth Mechanism, Tetanic Stimulation and Amnesic Mechanisms (ATRTD3), for continuous control of a multi-DOF manipulator. In the training process of the proposed algorithm, the weighting parameters of the neural network are learned using Tetanic stimulation and Amnesia mechanism. The main contribution of this paper is that we show a biomimetic view to speed up the converging process by biochemical reactions generated by neurons in the biological brain during memory and forgetting. The effectiveness of the proposed algorithm is validated by a simulation example including the comparisons with previously developed DRL algorithms. The results indicate that our approach shows performance improvement in terms of convergence speed and precision.

Download Full-text

Deterministic Value-Policy Gradients

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5732 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3316-3323

Author(s):

Qingpeng Cai ◽

Ling Pan ◽

Pingzhong Tang

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Infinite Horizon ◽

Gradient Algorithm ◽

Continuous Control ◽

Model Bias ◽

Model Free ◽

Policy Gradient ◽

Analytical Gradients

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Download Full-text

Revising the Observation Satellite Scheduling Problem Based on Deep Reinforcement Learning

Remote Sensing ◽

10.3390/rs13122377 ◽

2021 ◽

Vol 13 (12) ◽

pp. 2377

Author(s):

Yixin Huang ◽

Zhongcheng Mu ◽

Shufan Wu ◽

Benjie Cui ◽

Yuxiao Duan

Keyword(s):

Reinforcement Learning ◽

Task Scheduling ◽

Large Scale ◽

Gradient Algorithm ◽

Experimental Simulation ◽

Scheduling Problem ◽

Policy Gradient ◽

Scheduling Method ◽

Satellite Scheduling ◽

Metaheuristic Optimization Algorithms

Earth observation satellite task scheduling research plays a key role in space-based remote sensing services. An effective task scheduling strategy can maximize the utilization of satellite resources and obtain larger objective observation profits. In this paper, inspired by the success of deep reinforcement learning in optimization domains, the deep deterministic policy gradient algorithm is adopted to solve a time-continuous satellite task scheduling problem. Moreover, an improved graph-based minimum clique partition algorithm is proposed for preprocessing in the task clustering phase by considering the maximum task priority and the minimum observation slewing angle under constraint conditions. Experimental simulation results demonstrate that the deep reinforcement learning-based task scheduling method is feasible and performs much better than traditional metaheuristic optimization algorithms, especially in large-scale problems.

Download Full-text

UAV Autonomous Aerial Combat Maneuver Strategy Generation with Observation Error Based on State-Adversarial Deep Deterministic Policy Gradient and Inverse Reinforcement Learning

Electronics ◽

10.3390/electronics9071121 ◽

2020 ◽

Vol 9 (7) ◽

pp. 1121 ◽

Cited By ~ 2

Author(s):

Weiren Kong ◽

Deyun Zhou ◽

Zhen Yang ◽

Yiyang Zhao ◽

Kai Zhang

Keyword(s):

Reinforcement Learning ◽

High Performance ◽

Learning Algorithm ◽

Gradient Algorithm ◽

Observation Error ◽

Inverse Reinforcement Learning ◽

Generation Algorithm ◽

Air Combat ◽

Policy Gradient ◽

Aerial Combat

With the development of unmanned aerial vehicle (UAV) and artificial intelligence (AI) technology, Intelligent UAV will be widely used in future autonomous aerial combat. Previous researches on autonomous aerial combat within visual range (WVR) have limitations due to simplifying assumptions, limited robustness, and ignoring sensor errors. In this paper, in order to consider the error of the aircraft sensors, we model the aerial combat WVR as a state-adversarial Markov decision process (SA-MDP), which introduce the small adversarial perturbations on state observations and these perturbations do not alter the environment directly, but can mislead the agent into making suboptimal decisions. Meanwhile, we propose a novel autonomous aerial combat maneuver strategy generation algorithm with high-performance and high-robustness based on state-adversarial deep deterministic policy gradient algorithm (SA-DDPG), which add a robustness regularizers related to an upper bound on performance loss at the actor-network. At the same time, a reward shaping method based on maximum entropy (MaxEnt) inverse reinforcement learning algorithm (IRL) is proposed to improve the aerial combat strategy generation algorithm’s efficiency. Finally, the efficiency of the aerial combat strategy generation algorithm and the performance and robustness of the resulting aerial combat strategy is verified by simulation experiments. Our main contributions are three-fold. First, to introduce the observation errors of UAV, we are modeling air combat as SA-MDP. Second, to make the strategy network of air combat maneuver more robust in the presence of observation errors, we introduce regularizers into the policy gradient. Third, to solve the problem that air combat’s reward function is too sparse, we use MaxEnt IRL to design a shaping reward to accelerate the convergence of SA-DDPG.

Download Full-text

Mapless Collaborative Navigation for a Multi-Robot System Based on the Deep Reinforcement Learning

Applied Sciences ◽

10.3390/app9204198 ◽

2019 ◽

Vol 9 (20) ◽

pp. 4198

Author(s):

Wenzhou Chen ◽

Shizheng Zhou ◽

Zaisheng Pan ◽

Huixian Zheng ◽

Yong Liu

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Gradient Algorithm ◽

Lidar Data ◽

Robot System ◽

Navigation Task ◽

System A ◽

Group Navigation ◽

Policy Gradient ◽

Multi Robot

Compared with the single robot system, a multi-robot system has higher efficiency and fault tolerance. The multi-robot system has great potential in some application scenarios, such as the robot search, rescue and escort tasks, and so on. Deep reinforcement learning provides a potential framework for multi-robot formation and collaborative navigation. This paper mainly studies the collaborative formation and navigation of multi-robots by using the deep reinforcement learning algorithm. The proposed method improves the classical Deep Deterministic Policy Gradient (DDPG) to address the single robot mapless navigation task. We also extend the single-robot Deep Deterministic Policy Gradient algorithm to the multi-robot system, and obtain the Parallel Deep Deterministic Policy Gradient (PDDPG). By utilizing the 2D lidar sensor, the group of robots can accomplish the formation construction task and the collaborative formation navigation task. The experiment results in a Gazebo simulation platform illustrates that our method is capable of guiding mobile robots to construct the formation and keep the formation during group navigation, directly through raw lidar data inputs.

Download Full-text

Application of a Deep Deterministic Policy Gradient Algorithm for Energy-Aimed Timetable Rescheduling Problem

Energies ◽

10.3390/en12183461 ◽

2019 ◽

Vol 12 (18) ◽

pp. 3461 ◽

Cited By ~ 6

Author(s):

Guang Yang ◽

Feng Zhang ◽

Cheng Gong ◽

Shiwen Zhang

Keyword(s):

Reinforcement Learning ◽

Real Time ◽

Learning Algorithm ◽

Gradient Algorithm ◽

Q Learning ◽

Train Timetable ◽

Continuous State ◽

Policy Gradient ◽

Random Disturbances ◽

Metro Network

Reinforcement learning has potential in the area of intelligent transportation due to its generality and real-time feature. The Q-learning algorithm, which is an early proposed algorithm, has its own merits to solve the train timetable rescheduling (TTR) problem. However, it has shortage in two aspects: Dimensional limits of action and a slow convergence rate. In this paper, a deep deterministic policy gradient (DDPG) algorithm is applied to solve the energy-aimed train timetable rescheduling (ETTR) problem. This algorithm belongs to reinforcement learning, which fulfills real-time requirements of the ETTR problem, and has adaptability on random disturbances. Superior to the Q-learning, DDPG has a continuous state space and action space. After enough training, the learning agent based on DDPG takes proper action by adjusting the cruising speed and the dwelling time continuously for each train in a metro network when random disturbances happen. Although training needs an iteration for thousands of episodes, the policy decision during each testing episode takes a very short time. Models for the metro network, based on a real case of the Shanghai Metro Line 1, are established as a training and testing environment. To validate the energy-saving effect and the real-time feature of the proposed algorithm, four experiments are designed and conducted. Compared with the no action strategy, results show that the proposed algorithm has real-time performance, and saves a significant percentage of energy under random disturbances.

Download Full-text

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

ACM Transactions on Intelligent Systems and Technology ◽

10.1145/3452008 ◽

2021 ◽

Vol 12 (3) ◽

pp. 1-21

Author(s):

Shilei Li ◽

Meng Li ◽

Jiongming Su ◽

Shaofei Chen ◽

Zhimin Yuan ◽

...

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Gradient Methods ◽

Action Space ◽

Fine Tuning ◽

Continuous Control ◽

Parametric Perturbation ◽

Gradient Information ◽

Policy Gradient ◽

Gradient Based

Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.

Download Full-text

Diversity Evolutionary Policy Deep Reinforcement Learning

Computational Intelligence and Neuroscience ◽

10.1155/2021/5300189 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Jian Liu ◽

Liming Feng

Keyword(s):

Reinforcement Learning ◽

Entropy Method ◽

Gradient Algorithm ◽

Continuous Control ◽

Test Environment ◽

Maximum Mean Discrepancy ◽

Learning Agent ◽

Cross Entropy Method ◽

Policy Gradient ◽

Update Process

The reinforcement learning algorithms based on policy gradient may fall into local optimal due to gradient disappearance during the update process, which in turn affects the exploration ability of the reinforcement learning agent. In order to solve the above problem, in this paper, the cross-entropy method (CEM) in evolution policy, maximum mean difference (MMD), and twin delayed deep deterministic policy gradient algorithm (TD3) are combined to propose a diversity evolutionary policy deep reinforcement learning (DEPRL) algorithm. By using the maximum mean discrepancy as a measure of the distance between different policies, some of the policies in the population maximize the distance between them and the previous generation of policies while maximizing the cumulative return during the gradient update. Furthermore, combining the cumulative returns and the distance between policies as the fitness of the population encourages more diversity in the offspring policies, which in turn can reduce the risk of falling into local optimal due to the disappearance of the gradient. The results in the MuJoCo test environment show that DEPRL has achieved excellent performance on continuous control tasks; especially in the Ant-v2 environment, the return of DEPRL ultimately achieved a nearly 20% improvement compared to TD3.

Download Full-text

Toward Self-Driving Bicycles Using State-of-the-Art Deep Reinforcement Learning Algorithms

Symmetry ◽

10.3390/sym11020290 ◽

2019 ◽

Vol 11 (2) ◽

pp. 290 ◽

Cited By ~ 4

Author(s):

SeungYoon Choi ◽

Tuyen Le ◽

Quang Nguyen ◽

Md Layek ◽

SeungGwan Lee ◽

...

Keyword(s):

Reinforcement Learning ◽

Deep Neural Network ◽

Learning Algorithm ◽

State Of The Art ◽

The Other ◽

Gradient Algorithm ◽

Reward Function ◽

Policy Gradient ◽

Policy Optimization ◽

Start Location

In this paper, we propose a controller for a bicycle using the DDPG (Deep Deterministic Policy Gradient) algorithm, which is a state-of-the-art deep reinforcement learning algorithm. We use a reward function and a deep neural network to build the controller. By using the proposed controller, a bicycle can not only be stably balanced but also travel to any specified location. We confirm that the controller with DDPG shows better performance than the other baselines such as Normalized Advantage Function (NAF) and Proximal Policy Optimization (PPO). For the performance evaluation, we implemented the proposed algorithm in various settings such as fixed and random speed, start location, and destination location.

Download Full-text

Autonomous Bus Fleet Control Using Multiagent Reinforcement Learning

Journal of Advanced Transportation ◽

10.1155/2021/6654254 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Sung-Jung Wang ◽

S. K. Jason Chang

Keyword(s):

Reinforcement Learning ◽

Intelligent Agents ◽

Large Scale ◽

Gradient Algorithm ◽

Transport Systems ◽

Efficient Operation ◽

Fleet Size ◽

Agent Based ◽

Policy Gradient ◽

Multi Agent

Autonomous buses are becoming increasingly popular and have been widely developed in many countries. However, autonomous buses must learn to navigate the city efficiently to be integrated into public transport systems. Efficient operation of these buses can be achieved by intelligent agents through reinforcement learning. In this study, we investigate the autonomous bus fleet control problem, which appears noisy to the agents owing to random arrivals and incomplete observation of the environment. We propose a multi-agent reinforcement learning method combined with an advanced policy gradient algorithm for this large-scale dynamic optimization problem. An agent-based simulation platform was developed to model the dynamic system of a fixed stop/station loop route, autonomous bus fleet, and passengers. This platform was also applied to assess the performance of the proposed algorithm. The experimental results indicate that the developed algorithm outperforms other reinforcement learning methods in the multi-agent domain. The simulation results also reveal the effectiveness of our proposed algorithm in outperforming the existing scheduled bus system in terms of the bus fleet size and passenger wait times for bus routes with comparatively lesser number of passengers.

Download Full-text

Advances in Water Treatment Application of Sepiolite Mineral Materials

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.710.217 ◽

2013 ◽

Vol 710 ◽

pp. 217-220 ◽

Cited By ~ 1

Author(s):

Fei Wang ◽

Lei Feng ◽

Meng Ran Tang ◽

Ji Yuan Li ◽

Qing Guo Tang

Keyword(s):

Water Treatment ◽

Size Effect ◽

Large Scale ◽

Surface Effect ◽

Development Trend ◽

High Energy ◽

Research Area ◽

Group Mineral ◽

Active Research ◽

Treatment Application

Synthetic nanomaterials have the disadvantages of large-scale investment, high energy consumption, complex production process and heavy environmental load. Mineral nanomaterials such as sepiolite group mineral nanomaterials are characterized by small size effect, quantum size effect and surface effect. Water treatment application of sepiolite group mineral nanomaterials has become an active research area and showed good development and application prospects. Based on the above reasons, this paper systematically summarizes the water treatment application of sepiolite group mineral nanomaterials, and development trend related to water treatment application of sepiolite group mineral nanomaterials were also proposed.

Download Full-text