Deep Reinforcement Learning-Based Accurate Control of Planetary Soft Landing

Planetary soft landing has been studied extensively due to its promising application prospects. In this paper, a soft landing control algorithm based on deep reinforcement learning (DRL) with good convergence property is proposed. First, the soft landing problem of the powered descent phase is formulated and the theoretical basis of Reinforcement Learning (RL) used in this paper is introduced. Second, to make it easier to converge, a reward function is designed to include process rewards like velocity tracking reward, solving the problem of sparse reward. Then, by including the fuel consumption penalty and constraints violation penalty, the lander can learn to achieve velocity tracking goal while saving fuel and keeping attitude angle within safe ranges. Then, simulations of training are carried out under the frameworks of Deep deterministic policy gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor Critic (SAC), respectively, which are of the classical RL frameworks, and all converged. Finally, the trained policy is deployed into velocity tracking and soft landing experiments, results of which demonstrate the validity of the algorithm proposed.

Download Full-text

Preceding vehicle following algorithm with human driving characteristics

Proceedings of the Institution of Mechanical Engineers Part D Journal of Automobile Engineering ◽

10.1177/0954407020981546 ◽

2021 ◽

pp. 095440702098154

Author(s):

Feng Pan ◽

Hong Bao

Keyword(s):

Reinforcement Learning ◽

Weight Vector ◽

Gradient Algorithm ◽

Inner Product ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Human Driver ◽

Policy Gradient ◽

Preceding Vehicle ◽

Action Spaces

This paper proposes a new approach of using reinforcement learning (RL) to train an agent to perform the task of vehicle following with human driving characteristics. We refer to the ideal of inverse reinforcement learning to design the reward function of the RL model. The factors that need to be weighed in vehicle following were vectorized into reward vectors, and the reward function was defined as the inner product of the reward vector and weights. Driving data of human drivers was collected and analyzed to obtain the true reward function. The RL model was trained with the deterministic policy gradient algorithm because the state and action spaces are continuous. We adjusted the weight vector of the reward function so that the value vector of the RL model could continuously approach that of a human driver. After dozens of rounds of training, we selected the policy with the nearest value vector to that of a human driver and tested it in the PanoSim simulation environment. The results showed the desired performance for the task of an agent following the preceding vehicle safely and smoothly.

Download Full-text

PQROM: To optimize software defined network QoS-aware routing with proximal policy optimization

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211787 ◽

2021 ◽

pp. 1-10

Author(s):

Wei Zhou ◽

Xing Jiang ◽

Bingli Guo (Member, IEEE) ◽

Lingyu Meng

Keyword(s):

Software Defined Network ◽

Training Time ◽

Good Convergence ◽

Reward Function ◽

Routing Optimization ◽

Policy Gradient ◽

Discrete Action ◽

Policy Optimization ◽

Optimization Mechanism ◽

Network Pattern

Currently, Quality-of-Service (QoS)-aware routing is one of the crucial challenges in Software Defined Network (SDN). The QoS performances, e.g. latency, packet loss ratio and throughput, must be optimized to improve the performance of network. Traditional static routing algorithms based on Open Shortest Path First (OSPF) could not adapt to traffic fluctuation, which may cause severe network congestion and service degradation. Central intelligence of SDN controller and recent breakthroughs of Deep Reinforcement Learning (DRL) pose a promising solution to tackle this challenge. Thus, we propose an on-policy DRL mechanism, namely the PPO-based (Proximal Policy Optimization) QoS-aware Routing Optimization Mechanism (PQROM), to achieve a general and re-customizable routing optimization. PQROM can dynamically update the routing calculation by adjusting the reward function according to different optimization objectives, and it is independent of any specific network pattern. Additionally, as a black-box one-step optimization, PQROM is qualified for both continuous and discrete action space with high-dimensional input and output. The OMNeT ++ simulation experiment results show that PQROM not only has good convergence, but also has better stability compared with OSPF, less training time and simpler hyper-parameters adjustment than Deep Deterministic Policy Gradient (DDPG) and less hardware consumption than Asynchronous Advantage Actor-Critic (A3C).

Download Full-text

Dynamic Control Algorithm for Biped Walking Based on Policy Gradient Fuzzy Reinforcement Learning

IFAC Proceedings Volumes ◽

10.3182/20080706-5-kr-1001.00294 ◽

2008 ◽

Vol 41 (2) ◽

pp. 1717-1722 ◽

Cited By ~ 1

Author(s):

Duško M. Katić ◽

Aleksandar D. Rodić

Keyword(s):

Reinforcement Learning ◽

Control Algorithm ◽

Dynamic Control ◽

Biped Walking ◽

Policy Gradient

Download Full-text

Toward Diverse Text Generation with Inverse Reinforcement Learning

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/606 ◽

2018 ◽

Cited By ~ 5

Author(s):

Zhan Shi ◽

Xinchi Chen ◽

Xipeng Qiu ◽

Xuanjing Huang

Keyword(s):

Reinforcement Learning ◽

Generative Models ◽

Training Data ◽

Great Success ◽

Text Generation ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Total Reward ◽

Policy Gradient ◽

Adversarial Models

Text generation is a crucial task in NLP. Recently, several adversarial generative models have been proposed to improve the exposure bias problem in text generation. Though these models gain great success, they still suffer from the problems of reward sparsity and mode collapse. In order to address these two problems, in this paper, we employ inverse reinforcement learning (IRL) for text generation. Specifically, the IRL framework learns a reward function on training data, and then an optimal policy to maximum the expected total reward. Similar to the adversarial models, the reward and policy function in IRL are optimized alternately. Our method has two advantages: (1) the reward function can produce more dense reward signals. (2) the generation policy, trained by ``entropy regularized'' policy gradient, encourages to generate more diversified texts. Experiment results demonstrate that our proposed method can generate higher quality texts than the previous methods.

Download Full-text

On the use of the policy gradient and Hessian in inverse reinforcement learning

Intelligenza Artificiale ◽

10.3233/ia-180011 ◽

2020 ◽

Vol 14 (1) ◽

pp. 117-150

Author(s):

Alberto Maria Metelli ◽

Matteo Pirotta ◽

Marcello Restelli

Keyword(s):

Reinforcement Learning ◽

Sequential Decision ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Model Free ◽

Learning Speed ◽

Policy Gradient ◽

Continuous Domains ◽

Learning Policies ◽

Finite Domains

Reinforcement Learning (RL) is an effective approach to solve sequential decision making problems when the environment is equipped with a reward function to evaluate the agent’s actions. However, there are several domains in which a reward function is not available and difficult to estimate. When samples of expert agents are available, Inverse Reinforcement Learning (IRL) allows recovering a reward function that explains the demonstrated behavior. Most of the classic IRL methods, in addition to expert’s demonstrations, require sampling the environment to evaluate each reward function, that, in turn, is built starting from a set of engineered features. This paper is about a novel model-free IRL approach that does not require to specify a function space where to search for the expert’s reward function. Leveraging on the fact that the policy gradient needs to be zero for an optimal policy, the algorithm generates an approximation space for the reward function, in which a reward is singled out employing a second-order criterion. After introducing our approach for finite domains, we extend it to continuous ones. The empirical results, on both finite and continuous domains, show that the reward function recovered by our algorithm allows learning policies that outperform those obtained with the true reward function, in terms of learning speed.

Download Full-text

Deep Reinforcement Learning Automatic Landing Control of Fixed-Wing Aircraft Using Deep Deterministic Policy Gradient

2020 International Conference on Unmanned Aircraft Systems (ICUAS) ◽

10.1109/icuas48674.2020.9213987 ◽

2020 ◽

Author(s):

Chi Tang ◽

Ying-Chih Lai

Keyword(s):

Reinforcement Learning ◽

Automatic Landing ◽

Policy Gradient ◽

Landing Control

Download Full-text

Quadrotor Motion Control Using Deep Reinforcement Learning

Journal of Unmanned Vehicle Systems ◽

10.1139/juvs-2021-0010 ◽

2021 ◽

Author(s):

Zifei Jiang ◽

Alan F. Lynch

Keyword(s):

Reinforcement Learning ◽

Neural Nets ◽

Neural Net ◽

Reward Function ◽

Model Free ◽

Policy Gradient ◽

Aerial Vehicle ◽

Stochastic Controller ◽

Policy Optimization ◽

Gradient Approach

We present a deep neural net-based controller trained by a model-free reinforcement learning (RL) algorithm to achieve hover stabilization for a quadrotor unmanned aerial vehicle (UAV). With RL, two neural nets are trained. One neural net is used as a stochastic controller which gives the distribution of control inputs. The other maps the UAV state to a scalar which estimates the reward of the controller. A proximal policy optimization (PPO) method, which is an actor-critic policy gradient approach, is used to train the neural nets. Simulation results show that the trained controller achieves a comparable level of performance to a manually-tuned PID controller, despite not depending on any model information. The paper considers different choices of reward function and their influence on controller performance.

Download Full-text

Reinforcement Learning with Dynamic Boltzmann Softmax Updates

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/276 ◽

2020 ◽

Author(s):

Ling Pan ◽

Qingpeng Cai ◽

Qi Meng ◽

Wei Chen ◽

Longbo Huang

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Convergence Property ◽

Important Task ◽

Experimental Results ◽

Value Iteration ◽

Function Estimation ◽

Good Convergence ◽

Direct Use ◽

The Value Function

Value function estimation is an important task in reinforcement learning, i.e., prediction. The Boltzmann softmax operator is a natural value estimator and can provide several benefits. However, it does not satisfy the non-expansion property, and its direct use may fail to converge even in value iteration. In this paper, we propose to update the value function with dynamic Boltzmann softmax (DBS) operator, which has good convergence property in the setting of planning and learning. Experimental results on GridWorld show that the DBS operator enables better estimation of the value function, which rectifies the convergence issue of the softmax operator. Finally, we propose the DBS-DQN algorithm by applying the DBS operator, which outperforms DQN substantially in 40 out of 49 Atari games.

Download Full-text

Reward-Free Reinforcement Learning Algorithm Using Prediction Network

Fuzzy Systems and Data Mining VI - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200744 ◽

2020 ◽

Author(s):

Zhen Yu ◽

Yimin Feng ◽

Lijun Liu

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Value Functions ◽

Learning Method ◽

Reward Function ◽

Network Training ◽

Learning Tasks ◽

Reward Value ◽

Policy Gradient ◽

Reward Functions

In general reinforcement learning tasks, the formulation of reward functions is a very important step in reinforcement learning. The reward function is not easy to formulate in a large number of systems. The network training effect is sensitive to the reward function, and different reward value functions will get different results. For a class of systems that meet specific conditions, the traditional reinforcement learning method is improved. A state quantity function is designed to replace the reward function, which is more efficient than the traditional reward function. At the same time, the predictive network link is designed so that the network can learn the value of the general state by using the special state. The overall structure of the network will be improved based on the Deep Deterministic Policy Gradient (DDPG) algorithm. Finally, the algorithm was successfully applied in the environment of FrozenLake, and achieved good performance. The experiment proves the effectiveness of the algorithm and realizes rewardless reinforcement learning in a class of systems.

Download Full-text

Photovoltaic System MPPT Evaluation Using Classical, Meta-Heuristics, and Reinforcement Learning-Based Controllers: A Comparative Study

Journal of Southwest Jiaotong University ◽

10.35741/issn.0258-2724.56.3.1 ◽

2021 ◽

Vol 56 (3) ◽

pp. 1-17

Author(s):

Ekene G. Okafor ◽

Daniel Udekwe ◽

Osichinaka C. Ubadike ◽

Emmanuel Okafor ◽

Paul O. Jemitola ◽

...

Keyword(s):

Reinforcement Learning ◽

Activation Function ◽

Photovoltaic System ◽

Reward Function ◽

Point Tracking ◽

Power Point Tracking ◽

Pv Modules ◽

Policy Gradient ◽

Environmental Variations ◽

Power Point

Maximum power point tracking (MPPT) entails constraining photovoltaic (PV) modules to operate under a specified power condition. It has previously been shown that some meta-heuristic techniques often suffer from steady-state oscillations around maximum points and experience difficulty in adapting to environmental variations, such as irradiation and/or temperature. To address the aforementioned limitation, this work proposed an adaptable reinforcement learning (RL) technique based on a novel deep deterministic policy gradient (DDPG) agent and a reward function. The actor–network top layer uses a sigmoid activation function and the critic–network contains bottleneck layers with non-uniform nodal distributions as well as exponential linear unit (ELU) activation functions in some of the layers. The RL based on DDPG method was compared with Particle Swarm Optimization (PSO) and Perturb-and-Observe (P&O) in order to determine the optimal duty-cycle command needed for controlling the PV modules MPPT. All the investigated systems were implemented in MATLAB/Simulink. The results show that the proposed RL technique based on DDPG agent yielded superior tracking efficiency than all the other approaches. However, as the step change in irradiation at a constant temperature increases, the RL technique based on DDPG agent shows a decrease in tracking efficiency.

Download Full-text