Improving the efficiency of reinforcement learning for a spacecraft powered descent with Q-learning

Optimization and Engineering ◽

10.1007/s11081-021-09687-z ◽

2021 ◽

Author(s):

Callum Wilson ◽

Annalisa Riccardi

Keyword(s):

Reinforcement Learning ◽

State Space ◽

Initial Conditions ◽

Poor Performance ◽

Action Space ◽

State Representation ◽

Space Applications ◽

Spacecraft Control ◽

Hyperparameter Selection ◽

Powered Descent

AbstractReinforcement learning entails many intuitive and useful approaches to solving various problems. Its main premise is to learn how to complete tasks by interacting with the environment and observing which actions are more optimal with respect to a reward signal. Methods from reinforcement learning have long been applied in aerospace and have more recently seen renewed interest in space applications. Problems in spacecraft control can benefit from the use of intelligent techniques when faced with significant uncertainties—as is common for space environments. Solving these control problems using reinforcement learning remains a challenge partly due to long training times and sensitivity in performance to hyperparameters which require careful tuning. In this work we seek to address both issues for a sample spacecraft control problem. To reduce training times compared to other approaches, we simplify the problem by discretising the action space and use a data-efficient algorithm to train the agent. Furthermore, we employ an automated approach to hyperparameter selection which optimises for a specified performance metric. Our approach is tested on a 3-DOF powered descent problem with uncertainties in the initial conditions. We run experiments with two different problem formulations—using a ‘shaped’ state representation to guide the agent and also a ‘raw’ state representation with unprocessed values of position, velocity and mass. The results show that an agent can learn a near-optimal policy efficiently by appropriately defining the action-space and state-space. Using the raw state representation led to ‘reward-hacking’ and poor performance, which highlights the importance of the problem and state-space formulation in successfully training reinforcement learning agents. In addition, we show that the optimal hyperparameters can vary significantly based on the choice of loss function. Using two sets of hyperparameters optimised for different loss functions, we demonstrate that in both cases the agent can find near-optimal policies with comparable performance to previously applied methods.

Download Full-text

UAV Swarm Confrontation Using Hierarchical Multiagent Reinforcement Learning

International Journal of Aerospace Engineering ◽

10.1155/2021/3360116 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Baolai Wang ◽

Shengang Li ◽

Xianzhong Gao ◽

Tao Xie

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

State Space ◽

Optimal Solution ◽

Difficult Problem ◽

Action Space ◽

Sequential Decision ◽

Autonomous Decision ◽

Multiagent Reinforcement Learning ◽

Uav Swarm

With the development of unmanned aerial vehicle (UAV) technology, UAV swarm confrontation has attracted many researchers’ attention. However, the situation faced by the UAV swarm has substantial uncertainty and dynamic variability. The state space and action space increase exponentially with the number of UAVs, so that autonomous decision-making becomes a difficult problem in the confrontation environment. In this paper, a multiagent reinforcement learning method with macro action and human expertise is proposed for autonomous decision-making of UAVs. In the proposed approach, UAV swarm is modeled as a large multiagent system (MAS) with an individual UAV as an agent, and the sequential decision-making problem in swarm confrontation is modeled as a Markov decision process. Agents in the proposed method are trained based on the macro actions, where sparse and delayed rewards, large state space, and action space are effectively overcome. The key to the success of this method is the generation of the macro actions that allow the high-level policy to find a near-optimal solution. In this paper, we further leverage human expertise to design a set of good macro actions. Extensive empirical experiments in our constructed swarm confrontation environment show that our method performs better than the other algorithms.

Download Full-text

Pursuit-evasion with Decentralized Robotic Swarm in Continuous State Space and Action Space via Deep Reinforcement Learning

Proceedings of the 12th International Conference on Agents and Artificial Intelligence ◽

10.5220/0008971502260233 ◽

2020 ◽

Author(s):

Gurpreet Singh ◽

Daniel Lofaro ◽

Donald Sofge

Keyword(s):

Reinforcement Learning ◽

State Space ◽

Action Space ◽

Pursuit Evasion ◽

Continuous State Space ◽

Continuous State ◽

Robotic Swarm

Download Full-text

A Study of Continuous Maximum Entropy Deep Inverse Reinforcement Learning

Mathematical Problems in Engineering ◽

10.1155/2019/4834516 ◽

2019 ◽

Vol 2019 ◽

pp. 1-8

Author(s):

Xi-liang Chen ◽

Lei Cao ◽

Zhi-xiong Xu ◽

Jun Lai ◽

Chen-xi Li

Keyword(s):

Reinforcement Learning ◽

State Space ◽

Maximum Entropy ◽

Learning Algorithm ◽

Action Space ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Continuous State Space ◽

Hot Start ◽

Continuous State

The assumption of IRL is that demonstrations are optimally acting in an environment. In the past, most of the work on IRL needed to calculate optimal policies for different reward functions. However, this requirement is difficult to satisfy in large or continuous state space tasks. Let alone continuous action space. We propose a continuous maximum entropy deep inverse reinforcement learning algorithm for continuous state space and continues action space, which realizes the depth cognition of the environment model by the way of reconstructing the reward function based on the demonstrations, and a hot start mechanism based on demonstrations to make the training process faster and better. We compare this new approach to well-known IRL algorithms using Maximum Entropy IRL, DDPG, hot start DDPG, etc. Empirical results on classical control environments on OpenAI Gym: MountainCarContinues-v0 show that our approach is able to learn policies faster and better.

Download Full-text

Developmental Reinforcement Learning of Control Policy of a Quadcopter UAV With Thrust Vectoring Rotors

Volume 2: Intelligent Transportation/Vehicles; Manufacturing; Mechatronics; Engine/After-Treatment Systems; Soft Actuators/Manipulators; Modeling/Validation; Motion/Vibration Control Applications; Multi-Agent/Networked Systems; Path Planning/Motion Control; Renewable/Smart Energy Systems; Security/Privacy of Cyber-Physical Systems; Sensors/Actuators; Tracking Control Systems; Unmanned Ground/Aerial Vehicles; Vehicle Dynamics, Estimation, Control; Vibration/Control Systems; Vibrations ◽

10.1115/dscc2020-3319 ◽

2020 ◽

Author(s):

Aditya M. Deshpande ◽

Rumit Kumar ◽

Ali A. Minai ◽

Manish Kumar

Keyword(s):

Reinforcement Learning ◽

Degrees Of Freedom ◽

Initial Conditions ◽

Control Policy ◽

Policy Transfer ◽

Action Space ◽

Thrust Vectoring ◽

Tilt Rotor ◽

Force Magnitude ◽

Higher Dimensional

Abstract In this paper, we present a novel developmental reinforcement learning-based controller for a quadcopter with thrust vectoring capabilities. This multirotor UAV design has tilt-enabled rotors. It utilizes the rotor force magnitude and direction to achieve the desired state during flight. The control policy of this robot is learned using the policy transfer from the learned controller of the quadcopter (comparatively simple UAV design without thrust vectoring). This approach allows learning a control policy for systems with multiple inputs and multiple outputs. The performance of the learned policy is evaluated by physics-based simulations for the tasks of hovering and way-point navigation. The flight simulations utilize a flight controller based on reinforcement learning without any additional PID components. The results show faster learning with the presented approach as opposed to learning the control policy from scratch for this new UAV design created by modifications in a conventional quadcopter, i.e., the addition of more degrees of freedom (4-actuators in conventional quadcopter to 8-actuators in tilt-rotor quadcopter). We demonstrate the robustness of our learned policy by showing the recovery of the tilt-rotor platform in the simulation from various non-static initial conditions in order to reach a desired state. The developmental policy for the tilt-rotor UAV also showed superior fault tolerance when compared with the policy learned from the scratch. The results show the ability of the presented approach to bootstrap the learned behavior from a simpler system (lower-dimensional action-space) to a more complex robot (comparatively higher-dimensional action-space) and reach better performance faster.

Download Full-text

A game strategy model in the digital curling system based on NFSP

Complex & Intelligent Systems ◽

10.1007/s40747-021-00345-6 ◽

2021 ◽

Author(s):

Yuntao Han ◽

Qibin Zhou ◽

Fuqing Duan

Keyword(s):

Reinforcement Learning ◽

Nash Equilibrium ◽

Action Space ◽

Learning Networks ◽

Game Tree ◽

Continuous Action ◽

Extensive Game ◽

Strategy Model ◽

Zero Sum ◽

Tree Searching

AbstractThe digital curling game is a two-player zero-sum extensive game in a continuous action space. There are some challenging problems that are still not solved well, such as the uncertainty of strategy, the large game tree searching, and the use of large amounts of supervised data, etc. In this work, we combine NFSP and KR-UCT for digital curling games, where NFSP uses two adversary learning networks and can automatically produce supervised data, and KR-UCT can be used for large game tree searching in continuous action space. We propose two reward mechanisms to make reinforcement learning converge quickly. Experimental results validate the proposed method, and show the strategy model can reach the Nash equilibrium.

Download Full-text

Reinforcement learning versus swarm intelligence for autonomous multi-HAPS coordination

SN Applied Sciences ◽

10.1007/s42452-021-04658-6 ◽

2021 ◽

Vol 3 (6) ◽

Author(s):

Ogbonnaya Anicho ◽

Philip B. Charlesworth ◽

Gurvinder S. Baicher ◽

Atulya K. Nagar

Keyword(s):

Reinforcement Learning ◽

State Space ◽

Swarm Intelligence ◽

Performance Indicators ◽

Convergence Rates ◽

Tuning Parameters ◽

Continuous State Space ◽

Continuous State ◽

User Coverage ◽

Better Than

AbstractThis work analyses the performance of Reinforcement Learning (RL) versus Swarm Intelligence (SI) for coordinating multiple unmanned High Altitude Platform Stations (HAPS) for communications area coverage. It builds upon previous work which looked at various elements of both algorithms. The main aim of this paper is to address the continuous state-space challenge within this work by using partitioning to manage the high dimensionality problem. This enabled comparing the performance of the classical cases of both RL and SI establishing a baseline for future comparisons of improved versions. From previous work, SI was observed to perform better across various key performance indicators. However, after tuning parameters and empirically choosing suitable partitioning ratio for the RL state space, it was observed that the SI algorithm still maintained superior coordination capability by achieving higher mean overall user coverage (about 20% better than the RL algorithm), in addition to faster convergence rates. Though the RL technique showed better average peak user coverage, the unpredictable coverage dip was a key weakness, making SI a more suitable algorithm within the context of this work.

Download Full-text

Aircraft Maintenance Check Scheduling Using Reinforcement Learning

Aerospace ◽

10.3390/aerospace8040113 ◽

2021 ◽

Vol 8 (4) ◽

pp. 113

Author(s):

Pedro Andrade ◽

Catarina Silva ◽

Bernardete Ribeiro ◽

Bruno F. Santos

Keyword(s):

Reinforcement Learning ◽

Time Horizon ◽

Learning Algorithm ◽

Initial Conditions ◽

Q Learning ◽

Scheduling Policy ◽

Real Scenario ◽

Maintenance Plan ◽

Small Disturbances

This paper presents a Reinforcement Learning (RL) approach to optimize the long-term scheduling of maintenance for an aircraft fleet. The problem considers fleet status, maintenance capacity, and other maintenance constraints to schedule hangar checks for a specified time horizon. The checks are scheduled within an interval, and the goal is to, schedule them as close as possible to their due date. In doing so, the number of checks is reduced, and the fleet availability increases. A Deep Q-learning algorithm is used to optimize the scheduling policy. The model is validated in a real scenario using maintenance data from 45 aircraft. The maintenance plan that is generated with our approach is compared with a previous study, which presented a Dynamic Programming (DP) based approach and airline estimations for the same period. The results show a reduction in the number of checks scheduled, which indicates the potential of RL in solving this problem. The adaptability of RL is also tested by introducing small disturbances in the initial conditions. After training the model with these simulated scenarios, the results show the robustness of the RL approach and its ability to generate efficient maintenance plans in only a few seconds.

Download Full-text

On-Demand Channel Bonding in Heterogeneous WLANs: A Multi-Agent Deep Reinforcement Learning Approach

Sensors ◽

10.3390/s20102789 ◽

2020 ◽

Vol 20 (10) ◽

pp. 2789 ◽

Cited By ~ 1

Author(s):

Hang Qi ◽

Hao Huang ◽

Zhiqun Hu ◽

Xiangming Wen ◽

Zhaoming Lu

Keyword(s):

Reinforcement Learning ◽

Transmission Rate ◽

Single Agent ◽

Time Of Day ◽

Action Space ◽

Traffic Load ◽

Traffic Demand ◽

Channel Bonding ◽

On Demand ◽

Multi Agent

In order to meet the ever-increasing traffic demand of Wireless Local Area Networks (WLANs), channel bonding is introduced in IEEE 802.11 standards. Although channel bonding effectively increases the transmission rate, the wider channel reduces the number of non-overlapping channels and is more susceptible to interference. Meanwhile, the traffic load differs from one access point (AP) to another and changes significantly depending on the time of day. Therefore, the primary channel and channel bonding bandwidth should be carefully selected to meet traffic demand and guarantee the performance gain. In this paper, we proposed an On-Demand Channel Bonding (O-DCB) algorithm based on Deep Reinforcement Learning (DRL) for heterogeneous WLANs to reduce transmission delay, where the APs have different channel bonding capabilities. In this problem, the state space is continuous and the action space is discrete. However, the size of action space increases exponentially with the number of APs by using single-agent DRL, which severely affects the learning rate. To accelerate learning, Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is used to train O-DCB. Real traffic traces collected from a campus WLAN are used to train and test O-DCB. Simulation results reveal that the proposed algorithm has good convergence and lower delay than other algorithms.

Download Full-text

Anomaly detection using state‐space models and reinforcement learning

Structural Control and Health Monitoring ◽

10.1002/stc.2720 ◽

2021 ◽

Author(s):

Shervin Khazaeli ◽

Luong Ha Nguyen ◽

James A. Goulet

Keyword(s):

Reinforcement Learning ◽

Anomaly Detection ◽

State Space ◽

State Space Models

Download Full-text

Continuous reinforcement learning based ramp jump control for single-track two-wheeled robots

Transactions of the Institute of Measurement and Control ◽

10.1177/01423312211037847 ◽

2021 ◽

pp. 014233122110378

Author(s):

Qingyuan Zheng ◽

Duo Wang ◽

Zhang Chen ◽

Yiyong Sun ◽

Bin Liang

Keyword(s):

Reinforcement Learning ◽

Energy Savings ◽

Control Method ◽

Learning Control ◽

Action Space ◽

Gradient Algorithm ◽

Single Track ◽

Wheeled Robots ◽

Reward Function ◽

Wheeled Robot

Single-track two-wheeled robots have become an important research topic in recent years, owing to their simple structure, energy savings and ability to run on narrow roads. However, the ramp jump remains a challenging task. In this study, we propose to realize a single-track two-wheeled robot ramp jump. We present a control method that employs continuous action reinforcement learning techniques for single-track two-wheeled robot control. We design a novel reward function for reinforcement learning, optimize the dimensions of the action space, and enable training under the deep deterministic policy gradient algorithm. Finally, we validate the control method through simulation experiments and successfully realize the single-track two-wheeled robot ramp jump task. Simulation results validate that the control method is effective and has several advantages over high-dimension action space control, reinforcement learning control of sparse reward function and discrete action reinforcement learning control.

Download Full-text