scholarly journals Deep Reinforcement Learning-Based Accurate Control of Planetary Soft Landing

Sensors ◽  
2021 ◽  
Vol 21 (23) ◽  
pp. 8161
Author(s):  
Xibao Xu ◽  
Yushen Chen ◽  
Chengchao Bai

Planetary soft landing has been studied extensively due to its promising application prospects. In this paper, a soft landing control algorithm based on deep reinforcement learning (DRL) with good convergence property is proposed. First, the soft landing problem of the powered descent phase is formulated and the theoretical basis of Reinforcement Learning (RL) used in this paper is introduced. Second, to make it easier to converge, a reward function is designed to include process rewards like velocity tracking reward, solving the problem of sparse reward. Then, by including the fuel consumption penalty and constraints violation penalty, the lander can learn to achieve velocity tracking goal while saving fuel and keeping attitude angle within safe ranges. Then, simulations of training are carried out under the frameworks of Deep deterministic policy gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor Critic (SAC), respectively, which are of the classical RL frameworks, and all converged. Finally, the trained policy is deployed into velocity tracking and soft landing experiments, results of which demonstrate the validity of the algorithm proposed.

Author(s):  
Feng Pan ◽  
Hong Bao

This paper proposes a new approach of using reinforcement learning (RL) to train an agent to perform the task of vehicle following with human driving characteristics. We refer to the ideal of inverse reinforcement learning to design the reward function of the RL model. The factors that need to be weighed in vehicle following were vectorized into reward vectors, and the reward function was defined as the inner product of the reward vector and weights. Driving data of human drivers was collected and analyzed to obtain the true reward function. The RL model was trained with the deterministic policy gradient algorithm because the state and action spaces are continuous. We adjusted the weight vector of the reward function so that the value vector of the RL model could continuously approach that of a human driver. After dozens of rounds of training, we selected the policy with the nearest value vector to that of a human driver and tested it in the PanoSim simulation environment. The results showed the desired performance for the task of an agent following the preceding vehicle safely and smoothly.


2021 ◽  
pp. 1-10
Author(s):  
Wei Zhou ◽  
Xing Jiang ◽  
Bingli Guo (Member, IEEE) ◽  
Lingyu Meng

Currently, Quality-of-Service (QoS)-aware routing is one of the crucial challenges in Software Defined Network (SDN). The QoS performances, e.g. latency, packet loss ratio and throughput, must be optimized to improve the performance of network. Traditional static routing algorithms based on Open Shortest Path First (OSPF) could not adapt to traffic fluctuation, which may cause severe network congestion and service degradation. Central intelligence of SDN controller and recent breakthroughs of Deep Reinforcement Learning (DRL) pose a promising solution to tackle this challenge. Thus, we propose an on-policy DRL mechanism, namely the PPO-based (Proximal Policy Optimization) QoS-aware Routing Optimization Mechanism (PQROM), to achieve a general and re-customizable routing optimization. PQROM can dynamically update the routing calculation by adjusting the reward function according to different optimization objectives, and it is independent of any specific network pattern. Additionally, as a black-box one-step optimization, PQROM is qualified for both continuous and discrete action space with high-dimensional input and output. The OMNeT ++ simulation experiment results show that PQROM not only has good convergence, but also has better stability compared with OSPF, less training time and simpler hyper-parameters adjustment than Deep Deterministic Policy Gradient (DDPG) and less hardware consumption than Asynchronous Advantage Actor-Critic (A3C).


Author(s):  
Zhan Shi ◽  
Xinchi Chen ◽  
Xipeng Qiu ◽  
Xuanjing Huang

Text generation is a crucial task in NLP. Recently, several adversarial generative models have been proposed to improve the exposure bias problem in text generation. Though these models gain great success, they still suffer from the problems of reward sparsity and mode collapse. In order to address these two problems, in this paper, we employ inverse reinforcement learning (IRL) for text generation. Specifically, the IRL framework learns a reward function on training data, and then an optimal policy to maximum the expected total reward. Similar to the adversarial models, the reward and policy function in IRL are optimized alternately. Our method has two advantages: (1) the reward function can produce more dense reward signals. (2) the generation policy, trained by ``entropy regularized'' policy gradient, encourages to generate more diversified texts. Experiment results demonstrate that our proposed method can generate higher quality texts than the previous methods.


2020 ◽  
Vol 14 (1) ◽  
pp. 117-150
Author(s):  
Alberto Maria Metelli ◽  
Matteo Pirotta ◽  
Marcello Restelli

Reinforcement Learning (RL) is an effective approach to solve sequential decision making problems when the environment is equipped with a reward function to evaluate the agent’s actions. However, there are several domains in which a reward function is not available and difficult to estimate. When samples of expert agents are available, Inverse Reinforcement Learning (IRL) allows recovering a reward function that explains the demonstrated behavior. Most of the classic IRL methods, in addition to expert’s demonstrations, require sampling the environment to evaluate each reward function, that, in turn, is built starting from a set of engineered features. This paper is about a novel model-free IRL approach that does not require to specify a function space where to search for the expert’s reward function. Leveraging on the fact that the policy gradient needs to be zero for an optimal policy, the algorithm generates an approximation space for the reward function, in which a reward is singled out employing a second-order criterion. After introducing our approach for finite domains, we extend it to continuous ones. The empirical results, on both finite and continuous domains, show that the reward function recovered by our algorithm allows learning policies that outperform those obtained with the true reward function, in terms of learning speed.


Author(s):  
Zifei Jiang ◽  
Alan F. Lynch

We present a deep neural net-based controller trained by a model-free reinforcement learning (RL) algorithm to achieve hover stabilization for a quadrotor unmanned aerial vehicle (UAV). With RL, two neural nets are trained. One neural net is used as a stochastic controller which gives the distribution of control inputs. The other maps the UAV state to a scalar which estimates the reward of the controller. A proximal policy optimization (PPO) method, which is an actor-critic policy gradient approach, is used to train the neural nets. Simulation results show that the trained controller achieves a comparable level of performance to a manually-tuned PID controller, despite not depending on any model information. The paper considers different choices of reward function and their influence on controller performance.


Author(s):  
Ling Pan ◽  
Qingpeng Cai ◽  
Qi Meng ◽  
Wei Chen ◽  
Longbo Huang

Value function estimation is an important task in reinforcement learning, i.e., prediction. The Boltzmann softmax operator is a natural value estimator and can provide several benefits. However, it does not satisfy the non-expansion property, and its direct use may fail to converge even in value iteration. In this paper, we propose to update the value function with dynamic Boltzmann softmax (DBS) operator, which has good convergence property in the setting of planning and learning. Experimental results on GridWorld show that the DBS operator enables better estimation of the value function, which rectifies the convergence issue of the softmax operator. Finally, we propose the DBS-DQN algorithm by applying the DBS operator, which outperforms DQN substantially in 40 out of 49 Atari games.


Author(s):  
Zhen Yu ◽  
Yimin Feng ◽  
Lijun Liu

In general reinforcement learning tasks, the formulation of reward functions is a very important step in reinforcement learning. The reward function is not easy to formulate in a large number of systems. The network training effect is sensitive to the reward function, and different reward value functions will get different results. For a class of systems that meet specific conditions, the traditional reinforcement learning method is improved. A state quantity function is designed to replace the reward function, which is more efficient than the traditional reward function. At the same time, the predictive network link is designed so that the network can learn the value of the general state by using the special state. The overall structure of the network will be improved based on the Deep Deterministic Policy Gradient (DDPG) algorithm. Finally, the algorithm was successfully applied in the environment of FrozenLake, and achieved good performance. The experiment proves the effectiveness of the algorithm and realizes rewardless reinforcement learning in a class of systems.


2021 ◽  
Vol 56 (3) ◽  
pp. 1-17
Author(s):  
Ekene G. Okafor ◽  
Daniel Udekwe ◽  
Osichinaka C. Ubadike ◽  
Emmanuel Okafor ◽  
Paul O. Jemitola ◽  
...  

Maximum power point tracking (MPPT) entails constraining photovoltaic (PV) modules to operate under a specified power condition. It has previously been shown that some meta-heuristic techniques often suffer from steady-state oscillations around maximum points and experience difficulty in adapting to environmental variations, such as irradiation and/or temperature. To address the aforementioned limitation, this work proposed an adaptable reinforcement learning (RL) technique based on a novel deep deterministic policy gradient (DDPG) agent and a reward function. The actor–network top layer uses a sigmoid activation function and the critic–network contains bottleneck layers with non-uniform nodal distributions as well as exponential linear unit (ELU) activation functions in some of the layers. The RL based on DDPG method was compared with Particle Swarm Optimization (PSO) and Perturb-and-Observe (P&O) in order to determine the optimal duty-cycle command needed for controlling the PV modules MPPT. All the investigated systems were implemented in MATLAB/Simulink. The results show that the proposed RL technique based on DDPG agent yielded superior tracking efficiency than all the other approaches. However, as the step change in irradiation at a constant temperature increases, the RL technique based on DDPG agent shows a decrease in tracking efficiency.


Sign in / Sign up

Export Citation Format

Share Document