Developmental Reinforcement Learning of Control Policy of a Quadcopter UAV With Thrust Vectoring Rotors

Abstract In this paper, we present a novel developmental reinforcement learning-based controller for a quadcopter with thrust vectoring capabilities. This multirotor UAV design has tilt-enabled rotors. It utilizes the rotor force magnitude and direction to achieve the desired state during flight. The control policy of this robot is learned using the policy transfer from the learned controller of the quadcopter (comparatively simple UAV design without thrust vectoring). This approach allows learning a control policy for systems with multiple inputs and multiple outputs. The performance of the learned policy is evaluated by physics-based simulations for the tasks of hovering and way-point navigation. The flight simulations utilize a flight controller based on reinforcement learning without any additional PID components. The results show faster learning with the presented approach as opposed to learning the control policy from scratch for this new UAV design created by modifications in a conventional quadcopter, i.e., the addition of more degrees of freedom (4-actuators in conventional quadcopter to 8-actuators in tilt-rotor quadcopter). We demonstrate the robustness of our learned policy by showing the recovery of the tilt-rotor platform in the simulation from various non-static initial conditions in order to reach a desired state. The developmental policy for the tilt-rotor UAV also showed superior fault tolerance when compared with the policy learned from the scratch. The results show the ability of the presented approach to bootstrap the learned behavior from a simpler system (lower-dimensional action-space) to a more complex robot (comparatively higher-dimensional action-space) and reach better performance faster.

Download Full-text

Spatial Grammar-Based Recurrent Neural Network for Design Form and Behavior Optimization

Journal of Mechanical Design ◽

10.1115/1.4044398 ◽

2019 ◽

Vol 141 (12) ◽

Cited By ~ 2

Author(s):

Gary M. Stump ◽

Simon W. Miller ◽

Michael A. Yukish ◽

Timothy W. Simpson ◽

Conrad Tucker

Keyword(s):

Reinforcement Learning ◽

Degrees Of Freedom ◽

Parametric Design ◽

Control Policy ◽

Training Data ◽

High Performing ◽

First Case ◽

Spatial Grammar ◽

And Behavior

Abstract A novel method has been developed to optimize both the form and behavior of complex systems. The method uses spatial grammars embodied in character-recurrent neural networks (char-RNNs) to define the system including actuator numbers and degrees of freedom, reinforcement learning to optimize actuator behavior, and physics-based simulation systems to determine performance and provide (re)training data for the char-RNN. Compared to parametric design optimization with fixed numbers of inputs, using grammars and char-RNNs allows for a more complex, combinatorial infinite design space. In the proposed method, the char-RNN is first trained to learn a spatial grammar that defines the assembly layout, component geometries, material properties, and arbitrary numbers and degrees of freedom of actuators. Next, generated designs are evaluated using a physics-based environment, with an inner optimization loop using reinforcement learning to determine the best control policy for the actuators. The resulting design is thus optimized for both form and behavior, generated by a char-RNN embodying a high-performing grammar. Two evaluative case studies are presented using the design of the modular sailing craft. The first case study optimizes the design without actuated surfaces, allowing the char-RNN to understand the semantics of high-performing designs. The second case study extends the first by incorporating controllable actuators requiring an inner loop behavioral optimization. The implications of the results are discussed along with the ongoing and future work.

Download Full-text

Control of chaotic systems by deep reinforcement learning

Proceedings of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rspa.2019.0351 ◽

2019 ◽

Vol 475 (2231) ◽

pp. 20190351 ◽

Cited By ~ 3

Author(s):

M. A. Bucci ◽

O. Semeraro ◽

A. Allauzen ◽

G. Wisniewski ◽

L. Cordier ◽

...

Keyword(s):

Reinforcement Learning ◽

Bluff Body ◽

Initial Conditions ◽

Control Policy ◽

Chaotic Regime ◽

Model Free ◽

Local Measurements ◽

Target States ◽

The One ◽

Learning Principles

Deep reinforcement learning (DRL) is applied to control a nonlinear, chaotic system governed by the one-dimensional Kuramoto–Sivashinsky (KS) equation. DRL uses reinforcement learning principles for the determination of optimal control solutions and deep neural networks for approximating the value function and the control policy. Recent applications have shown that DRL may achieve superhuman performance in complex cognitive tasks. In this work, we show that using restricted localized actuation, partial knowledge of the state based on limited sensor measurements and model-free DRL controllers, it is possible to stabilize the dynamics of the KS system around its unstable fixed solutions, here considered as target states. The robustness of the controllers is tested by considering several trajectories in the phase space emanating from different initial conditions; we show that DRL is always capable of driving and stabilizing the dynamics around target states. The possibility of controlling the KS system in the chaotic regime by using a DRL strategy solely relying on local measurements suggests the extension of the application of RL methods to the control of more complex systems such as drag reduction in bluff-body wakes or the enhancement/diminution of turbulent mixing.

Download Full-text

Improving the efficiency of reinforcement learning for a spacecraft powered descent with Q-learning

Optimization and Engineering ◽

10.1007/s11081-021-09687-z ◽

2021 ◽

Author(s):

Callum Wilson ◽

Annalisa Riccardi

Keyword(s):

Reinforcement Learning ◽

State Space ◽

Initial Conditions ◽

Poor Performance ◽

Action Space ◽

State Representation ◽

Space Applications ◽

Spacecraft Control ◽

Hyperparameter Selection ◽

Powered Descent

AbstractReinforcement learning entails many intuitive and useful approaches to solving various problems. Its main premise is to learn how to complete tasks by interacting with the environment and observing which actions are more optimal with respect to a reward signal. Methods from reinforcement learning have long been applied in aerospace and have more recently seen renewed interest in space applications. Problems in spacecraft control can benefit from the use of intelligent techniques when faced with significant uncertainties—as is common for space environments. Solving these control problems using reinforcement learning remains a challenge partly due to long training times and sensitivity in performance to hyperparameters which require careful tuning. In this work we seek to address both issues for a sample spacecraft control problem. To reduce training times compared to other approaches, we simplify the problem by discretising the action space and use a data-efficient algorithm to train the agent. Furthermore, we employ an automated approach to hyperparameter selection which optimises for a specified performance metric. Our approach is tested on a 3-DOF powered descent problem with uncertainties in the initial conditions. We run experiments with two different problem formulations—using a ‘shaped’ state representation to guide the agent and also a ‘raw’ state representation with unprocessed values of position, velocity and mass. The results show that an agent can learn a near-optimal policy efficiently by appropriately defining the action-space and state-space. Using the raw state representation led to ‘reward-hacking’ and poor performance, which highlights the importance of the problem and state-space formulation in successfully training reinforcement learning agents. In addition, we show that the optimal hyperparameters can vary significantly based on the choice of loss function. Using two sets of hyperparameters optimised for different loss functions, we demonstrate that in both cases the agent can find near-optimal policies with comparable performance to previously applied methods.

Download Full-text

COMBINING CORRELATION-BASED AND REWARD-BASED LEARNING IN NEURAL CONTROL FOR POLICY IMPROVEMENT

Advances in Complex Systems ◽

10.1142/s021952591350015x ◽

2013 ◽

Vol 16 (02n03) ◽

pp. 1350015 ◽

Cited By ~ 7

Author(s):

PORAMATE MANOONPONG ◽

CHRISTOPH KOLODZIEJSKI ◽

FLORENTIN WÖRGÖTTER ◽

JUN MORIMOTO

Keyword(s):

Reinforcement Learning ◽

Control Problem ◽

Neural Control ◽

Initial Conditions ◽

Model Performance ◽

Control Policy ◽

Learning Mechanisms ◽

Correlation Learning ◽

Successful Control ◽

Goal Directed Behavior

Classical conditioning (conventionally modeled as correlation-based learning) and operant conditioning (conventionally modeled as reinforcement learning or reward-based learning) have been found in biological systems. Evidence shows that these two mechanisms strongly involve learning about associations. Based on these biological findings, we propose a new learning model to achieve successful control policies for artificial systems. This model combines correlation-based learning using input correlation learning (ICO learning) and reward-based learning using continuous actor–critic reinforcement learning (RL), thereby working as a dual learner system. The model performance is evaluated by simulations of a cart-pole system as a dynamic motion control problem and a mobile robot system as a goal-directed behavior control problem. Results show that the model can strongly improve pole balancing control policy, i.e., it allows the controller to learn stabilizing the pole in the largest domain of initial conditions compared to the results obtained when using a single learning mechanism. This model can also find a successful control policy for goal-directed behavior, i.e., the robot can effectively learn to approach a given goal compared to its individual components. Thus, the study pursued here sharpens our understanding of how two different learning mechanisms can be combined and complement each other for solving complex tasks.

Download Full-text

Reinforcement learning control of a biomechanical model of the upper extremity

Scientific Reports ◽

10.1038/s41598-021-93760-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Florian Fischer ◽

Miroslav Bachinski ◽

Markus Klar ◽

Arthur Fleig ◽

Jörg Müller

Keyword(s):

Reinforcement Learning ◽

Upper Extremity ◽

Degrees Of Freedom ◽

State Of The Art ◽

Movement Time ◽

Biomechanical Model ◽

Control Policy ◽

Human Movement ◽

Motor Noise ◽

Skeletal Model

AbstractAmong the infinite number of possible movements that can be produced, humans are commonly assumed to choose those that optimize criteria such as minimizing movement time, subject to certain movement constraints like signal-dependent and constant motor noise. While so far these assumptions have only been evaluated for simplified point-mass or planar models, we address the question of whether they can predict reaching movements in a full skeletal model of the human upper extremity. We learn a control policy using a motor babbling approach as implemented in reinforcement learning, using aimed movements of the tip of the right index finger towards randomly placed 3D targets of varying size. We use a state-of-the-art biomechanical model, which includes seven actuated degrees of freedom. To deal with the curse of dimensionality, we use a simplified second-order muscle model, acting at each degree of freedom instead of individual muscles. The results confirm that the assumptions of signal-dependent and constant motor noise, together with the objective of movement time minimization, are sufficient for a state-of-the-art skeletal model of the human upper extremity to reproduce complex phenomena of human movement, in particular Fitts’ Law and the $$\frac{2}{3}$$ 2 3 Power Law. This result supports the notion that control of the complex human biomechanical system can plausibly be determined by a set of simple assumptions and can easily be learned.

Download Full-text

A game strategy model in the digital curling system based on NFSP

Complex & Intelligent Systems ◽

10.1007/s40747-021-00345-6 ◽

2021 ◽

Author(s):

Yuntao Han ◽

Qibin Zhou ◽

Fuqing Duan

Keyword(s):

Reinforcement Learning ◽

Nash Equilibrium ◽

Action Space ◽

Learning Networks ◽

Game Tree ◽

Continuous Action ◽

Extensive Game ◽

Strategy Model ◽

Zero Sum ◽

Tree Searching

AbstractThe digital curling game is a two-player zero-sum extensive game in a continuous action space. There are some challenging problems that are still not solved well, such as the uncertainty of strategy, the large game tree searching, and the use of large amounts of supervised data, etc. In this work, we combine NFSP and KR-UCT for digital curling games, where NFSP uses two adversary learning networks and can automatically produce supervised data, and KR-UCT can be used for large game tree searching in continuous action space. We propose two reward mechanisms to make reinforcement learning converge quickly. Experimental results validate the proposed method, and show the strategy model can reach the Nash equilibrium.

Download Full-text

Aircraft Maintenance Check Scheduling Using Reinforcement Learning

Aerospace ◽

10.3390/aerospace8040113 ◽

2021 ◽

Vol 8 (4) ◽

pp. 113

Author(s):

Pedro Andrade ◽

Catarina Silva ◽

Bernardete Ribeiro ◽

Bruno F. Santos

Keyword(s):

Reinforcement Learning ◽

Time Horizon ◽

Learning Algorithm ◽

Initial Conditions ◽

Q Learning ◽

Scheduling Policy ◽

Real Scenario ◽

Maintenance Plan ◽

Small Disturbances

This paper presents a Reinforcement Learning (RL) approach to optimize the long-term scheduling of maintenance for an aircraft fleet. The problem considers fleet status, maintenance capacity, and other maintenance constraints to schedule hangar checks for a specified time horizon. The checks are scheduled within an interval, and the goal is to, schedule them as close as possible to their due date. In doing so, the number of checks is reduced, and the fleet availability increases. A Deep Q-learning algorithm is used to optimize the scheduling policy. The model is validated in a real scenario using maintenance data from 45 aircraft. The maintenance plan that is generated with our approach is compared with a previous study, which presented a Dynamic Programming (DP) based approach and airline estimations for the same period. The results show a reduction in the number of checks scheduled, which indicates the potential of RL in solving this problem. The adaptability of RL is also tested by introducing small disturbances in the initial conditions. After training the model with these simulated scenarios, the results show the robustness of the RL approach and its ability to generate efficient maintenance plans in only a few seconds.

Download Full-text

On-Demand Channel Bonding in Heterogeneous WLANs: A Multi-Agent Deep Reinforcement Learning Approach

Sensors ◽

10.3390/s20102789 ◽

2020 ◽

Vol 20 (10) ◽

pp. 2789 ◽

Cited By ~ 1

Author(s):

Hang Qi ◽

Hao Huang ◽

Zhiqun Hu ◽

Xiangming Wen ◽

Zhaoming Lu

Keyword(s):

Reinforcement Learning ◽

Transmission Rate ◽

Single Agent ◽

Time Of Day ◽

Action Space ◽

Traffic Load ◽

Traffic Demand ◽

Channel Bonding ◽

On Demand ◽

Multi Agent

In order to meet the ever-increasing traffic demand of Wireless Local Area Networks (WLANs), channel bonding is introduced in IEEE 802.11 standards. Although channel bonding effectively increases the transmission rate, the wider channel reduces the number of non-overlapping channels and is more susceptible to interference. Meanwhile, the traffic load differs from one access point (AP) to another and changes significantly depending on the time of day. Therefore, the primary channel and channel bonding bandwidth should be carefully selected to meet traffic demand and guarantee the performance gain. In this paper, we proposed an On-Demand Channel Bonding (O-DCB) algorithm based on Deep Reinforcement Learning (DRL) for heterogeneous WLANs to reduce transmission delay, where the APs have different channel bonding capabilities. In this problem, the state space is continuous and the action space is discrete. However, the size of action space increases exponentially with the number of APs by using single-agent DRL, which severely affects the learning rate. To accelerate learning, Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is used to train O-DCB. Real traffic traces collected from a campus WLAN are used to train and test O-DCB. Simulation results reveal that the proposed algorithm has good convergence and lower delay than other algorithms.

Download Full-text

Nonlinear Behavior of a Magnetic Bearing System

Journal of Engineering for Gas Turbines and Power ◽

10.1115/1.2814135 ◽

1995 ◽

Vol 117 (3) ◽

pp. 582-588 ◽

Cited By ~ 29

Author(s):

L. N. Virgin ◽

T. F. Walsh ◽

J. D. Knight

Keyword(s):

Degrees Of Freedom ◽

Initial Conditions ◽

Nonlinear Behavior ◽

Analytical Techniques ◽

Magnetic Bearing ◽

Forced Response ◽

The Mathematical Model ◽

Two Degree Of Freedom ◽

Sensitivity To Initial Conditions ◽

Bearing System

This paper describes the results of a study into the dynamic behavior of a magnetic bearing system. The research focuses attention on the influence of nonlinearities on the forced response of a two-degree-of-freedom rotating mass suspended by magnetic bearings and subject to rotating unbalance and feedback control. Geometric coupling between the degrees of freedom leads to a pair of nonlinear ordinary differential equations, which are then solved using both numerical simulation and approximate analytical techniques. The system exhibits a variety of interesting and somewhat unexpected phenomena including various amplitude driven bifurcational events, sensitivity to initial conditions, and the complete loss of stability associated with the escape from the potential well in which the system can be thought to be oscillating. An approximate criterion to avoid this last possibility is developed based on concepts of limiting the response of the system. The present paper may be considered as an extension to an earlier study by the same authors, which described the practical context of the work, free vibration, control aspects, and derivation of the mathematical model.

Download Full-text

Scalable co-optimization of morphology and control in embodied machines

Journal of The Royal Society Interface ◽

10.1098/rsif.2017.0937 ◽

2018 ◽

Vol 15 (143) ◽

pp. 20170937 ◽

Cited By ~ 10

Author(s):

Nick Cheney ◽

Josh Bongard ◽

Vytas SunSpiral ◽

Hod Lipson

Keyword(s):

Embodied Cognition ◽

Initial Conditions ◽

Control Policy ◽

Sensorimotor Control ◽

Body Plan ◽

The Body ◽

Test Bed ◽

Local Optima ◽

Close Coupling ◽

And Control

Evolution sculpts both the body plans and nervous systems of agents together over time. By contrast, in artificial intelligence and robotics, a robot's body plan is usually designed by hand, and control policies are then optimized for that fixed design. The task of simultaneously co-optimizing the morphology and controller of an embodied robot has remained a challenge. In psychology, the theory of embodied cognition posits that behaviour arises from a close coupling between body plan and sensorimotor control, which suggests why co-optimizing these two subsystems is so difficult: most evolutionary changes to morphology tend to adversely impact sensorimotor control, leading to an overall decrease in behavioural performance. Here, we further examine this hypothesis and demonstrate a technique for ‘morphological innovation protection’, which temporarily reduces selection pressure on recently morphologically changed individuals, thus enabling evolution some time to ‘readapt’ to the new morphology with subsequent control policy mutations. We show the potential for this method to avoid local optima and converge to similar highly fit morphologies across widely varying initial conditions, while sustaining fitness improvements further into optimization. While this technique is admittedly only the first of many steps that must be taken to achieve scalable optimization of embodied machines, we hope that theoretical insight into the cause of evolutionary stagnation in current methods will help to enable the automation of robot design and behavioural training—while simultaneously providing a test bed to investigate the theory of embodied cognition.

Download Full-text