COMBINING CORRELATION-BASED AND REWARD-BASED LEARNING IN NEURAL CONTROL FOR POLICY IMPROVEMENT

2013 ◽  
Vol 16 (02n03) ◽  
pp. 1350015 ◽  
Author(s):  
PORAMATE MANOONPONG ◽  
CHRISTOPH KOLODZIEJSKI ◽  
FLORENTIN WÖRGÖTTER ◽  
JUN MORIMOTO

Classical conditioning (conventionally modeled as correlation-based learning) and operant conditioning (conventionally modeled as reinforcement learning or reward-based learning) have been found in biological systems. Evidence shows that these two mechanisms strongly involve learning about associations. Based on these biological findings, we propose a new learning model to achieve successful control policies for artificial systems. This model combines correlation-based learning using input correlation learning (ICO learning) and reward-based learning using continuous actor–critic reinforcement learning (RL), thereby working as a dual learner system. The model performance is evaluated by simulations of a cart-pole system as a dynamic motion control problem and a mobile robot system as a goal-directed behavior control problem. Results show that the model can strongly improve pole balancing control policy, i.e., it allows the controller to learn stabilizing the pole in the largest domain of initial conditions compared to the results obtained when using a single learning mechanism. This model can also find a successful control policy for goal-directed behavior, i.e., the robot can effectively learn to approach a given goal compared to its individual components. Thus, the study pursued here sharpens our understanding of how two different learning mechanisms can be combined and complement each other for solving complex tasks.

Author(s):  
M. A. Bucci ◽  
O. Semeraro ◽  
A. Allauzen ◽  
G. Wisniewski ◽  
L. Cordier ◽  
...  

Deep reinforcement learning (DRL) is applied to control a nonlinear, chaotic system governed by the one-dimensional Kuramoto–Sivashinsky (KS) equation. DRL uses reinforcement learning principles for the determination of optimal control solutions and deep neural networks for approximating the value function and the control policy. Recent applications have shown that DRL may achieve superhuman performance in complex cognitive tasks. In this work, we show that using restricted localized actuation, partial knowledge of the state based on limited sensor measurements and model-free DRL controllers, it is possible to stabilize the dynamics of the KS system around its unstable fixed solutions, here considered as target states. The robustness of the controllers is tested by considering several trajectories in the phase space emanating from different initial conditions; we show that DRL is always capable of driving and stabilizing the dynamics around target states. The possibility of controlling the KS system in the chaotic regime by using a DRL strategy solely relying on local measurements suggests the extension of the application of RL methods to the control of more complex systems such as drag reduction in bluff-body wakes or the enhancement/diminution of turbulent mixing.


Author(s):  
Aditya M. Deshpande ◽  
Rumit Kumar ◽  
Ali A. Minai ◽  
Manish Kumar

Abstract In this paper, we present a novel developmental reinforcement learning-based controller for a quadcopter with thrust vectoring capabilities. This multirotor UAV design has tilt-enabled rotors. It utilizes the rotor force magnitude and direction to achieve the desired state during flight. The control policy of this robot is learned using the policy transfer from the learned controller of the quadcopter (comparatively simple UAV design without thrust vectoring). This approach allows learning a control policy for systems with multiple inputs and multiple outputs. The performance of the learned policy is evaluated by physics-based simulations for the tasks of hovering and way-point navigation. The flight simulations utilize a flight controller based on reinforcement learning without any additional PID components. The results show faster learning with the presented approach as opposed to learning the control policy from scratch for this new UAV design created by modifications in a conventional quadcopter, i.e., the addition of more degrees of freedom (4-actuators in conventional quadcopter to 8-actuators in tilt-rotor quadcopter). We demonstrate the robustness of our learned policy by showing the recovery of the tilt-rotor platform in the simulation from various non-static initial conditions in order to reach a desired state. The developmental policy for the tilt-rotor UAV also showed superior fault tolerance when compared with the policy learned from the scratch. The results show the ability of the presented approach to bootstrap the learned behavior from a simpler system (lower-dimensional action-space) to a more complex robot (comparatively higher-dimensional action-space) and reach better performance faster.


Mathematics ◽  
2020 ◽  
Vol 8 (6) ◽  
pp. 1036
Author(s):  
Constantin Udrişte ◽  
Ionel Ţevy

In this paper, we present the mathematical point of view of our research group regarding the multi-robot systems evolving in a multi-temporal way. We solve the minimum multi-time volume problem as optimal control problem for a group of planar micro-robots moving in the same direction at different partial speeds. We are motivated to solve this problem because a similar minimum-time optimal control problem is now in vogue for micro-scale and nano-scale robotic systems. Applying the (weak and strong) multi-time maximum principle, we obtain necessary conditions for optimality and that are used to guess a candidate control policy. The complexity of finding this policy for arbitrary initial conditions is dominated by the computation of a planar convex hull. We pointed this idea by applying the technique of multi-time Hamilton-Jacobi-Bellman PDE. Our results can be extended to consider obstacle avoidance by explicit parameterization of all possible optimal control policies.


Author(s):  
Andrea Pesare ◽  
Michele Palladino ◽  
Maurizio Falcone

AbstractIn this paper, we will deal with a linear quadratic optimal control problem with unknown dynamics. As a modeling assumption, we will suppose that the knowledge that an agent has on the current system is represented by a probability distribution $$\pi $$ π on the space of matrices. Furthermore, we will assume that such a probability measure is opportunely updated to take into account the increased experience that the agent obtains while exploring the environment, approximating with increasing accuracy the underlying dynamics. Under these assumptions, we will show that the optimal control obtained by solving the “average” linear quadratic optimal control problem with respect to a certain $$\pi $$ π converges to the optimal control driven related to the linear quadratic optimal control problem governed by the actual, underlying dynamics. This approach is closely related to model-based reinforcement learning algorithms where prior and posterior probability distributions describing the knowledge on the uncertain system are recursively updated. In the last section, we will show a numerical test that confirms the theoretical results.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
A. Gorin ◽  
V. Klucharev ◽  
A. Ossadtchi ◽  
I. Zubarev ◽  
V. Moiseeva ◽  
...  

AbstractPeople often change their beliefs by succumbing to an opinion of others. Such changes are often referred to as effects of social influence. While some previous studies have focused on the reinforcement learning mechanisms of social influence or on its internalization, others have reported evidence of changes in sensory processing evoked by social influence of peer groups. In this study, we used magnetoencephalographic (MEG) source imaging to further investigate the long-term effects of agreement and disagreement with the peer group. The study was composed of two sessions. During the first session, participants rated the trustworthiness of faces and subsequently learned group rating of each face. In the first session, a neural marker of an immediate mismatch between individual and group opinions was found in the posterior cingulate cortex, an area involved in conflict-monitoring and reinforcement learning. To identify the neural correlates of the long-lasting effect of the group opinion, we analysed MEG activity while participants rated faces during the second session. We found MEG traces of past disagreement or agreement with the peers at the parietal cortices 230 ms after the face onset. The neural activity of the superior parietal lobule, intraparietal sulcus, and precuneus was significantly stronger when the participant’s rating had previously differed from the ratings of the peers. The early MEG correlates of disagreement with the majority were followed by activity in the orbitofrontal cortex 320 ms after the face onset. Altogether, the results reveal the temporal dynamics of the neural mechanism of long-term effects of disagreement with the peer group: early signatures of modified face processing were followed by later markers of long-term social influence on the valuation process at the ventromedial prefrontal cortex.


Aerospace ◽  
2021 ◽  
Vol 8 (4) ◽  
pp. 113
Author(s):  
Pedro Andrade ◽  
Catarina Silva ◽  
Bernardete Ribeiro ◽  
Bruno F. Santos

This paper presents a Reinforcement Learning (RL) approach to optimize the long-term scheduling of maintenance for an aircraft fleet. The problem considers fleet status, maintenance capacity, and other maintenance constraints to schedule hangar checks for a specified time horizon. The checks are scheduled within an interval, and the goal is to, schedule them as close as possible to their due date. In doing so, the number of checks is reduced, and the fleet availability increases. A Deep Q-learning algorithm is used to optimize the scheduling policy. The model is validated in a real scenario using maintenance data from 45 aircraft. The maintenance plan that is generated with our approach is compared with a previous study, which presented a Dynamic Programming (DP) based approach and airline estimations for the same period. The results show a reduction in the number of checks scheduled, which indicates the potential of RL in solving this problem. The adaptability of RL is also tested by introducing small disturbances in the initial conditions. After training the model with these simulated scenarios, the results show the robustness of the RL approach and its ability to generate efficient maintenance plans in only a few seconds.


2013 ◽  
Vol 461 ◽  
pp. 565-569 ◽  
Author(s):  
Fang Wang ◽  
Kai Xu ◽  
Qiao Sheng Zhang ◽  
Yi Wen Wang ◽  
Xiao Xiang Zheng

Brain-machine interfaces (BMIs) decode cortical neural spikes of paralyzed patients to control external devices for the purpose of movement restoration. Neuroplasticity induced by conducting a relatively complex task within multistep, is helpful to performance improvements of BMI system. Reinforcement learning (RL) allows the BMI system to interact with the environment to learn the task adaptively without a teacher signal, which is more appropriate to the case for paralyzed patients. In this work, we proposed to apply Q(λ)-learning to multistep goal-directed tasks using users neural activity. Neural data were recorded from M1 of a monkey manipulating a joystick in a center-out task. Compared with a supervised learning approach, significant BMI control was achieved with correct directional decoding in 84.2% and 81% of the trials from naïve states. The results demonstrate that the BMI system was able to complete a task by interacting with the environment, indicating that RL-based methods have the potential to develop more natural BMI systems.


2018 ◽  
Vol 15 (143) ◽  
pp. 20170937 ◽  
Author(s):  
Nick Cheney ◽  
Josh Bongard ◽  
Vytas SunSpiral ◽  
Hod Lipson

Evolution sculpts both the body plans and nervous systems of agents together over time. By contrast, in artificial intelligence and robotics, a robot's body plan is usually designed by hand, and control policies are then optimized for that fixed design. The task of simultaneously co-optimizing the morphology and controller of an embodied robot has remained a challenge. In psychology, the theory of embodied cognition posits that behaviour arises from a close coupling between body plan and sensorimotor control, which suggests why co-optimizing these two subsystems is so difficult: most evolutionary changes to morphology tend to adversely impact sensorimotor control, leading to an overall decrease in behavioural performance. Here, we further examine this hypothesis and demonstrate a technique for ‘morphological innovation protection’, which temporarily reduces selection pressure on recently morphologically changed individuals, thus enabling evolution some time to ‘readapt’ to the new morphology with subsequent control policy mutations. We show the potential for this method to avoid local optima and converge to similar highly fit morphologies across widely varying initial conditions, while sustaining fitness improvements further into optimization. While this technique is admittedly only the first of many steps that must be taken to achieve scalable optimization of embodied machines, we hope that theoretical insight into the cause of evolutionary stagnation in current methods will help to enable the automation of robot design and behavioural training—while simultaneously providing a test bed to investigate the theory of embodied cognition.


Author(s):  
Damien Ernst ◽  
Mevludin Glavic ◽  
Pierre Geurts ◽  
Louis Wehenkel

In this paper we explain how to design intelligent agents able to process the information acquired from interaction with a system to learn a good control policy and show how the methodology can be applied to control some devices aimed to damp electrical power oscillations. The control problem is formalized as a discrete-time optimal control problem and the information acquired from interaction with the system is a set of samples, where each sample is composed of four elements: a state, the action taken while being in this state, the instantaneous reward observed and the successor state of the system. To process this information we consider reinforcement learning algorithms that determine an approximation of the so-called Q-function by mimicking the behavior of the value iteration algorithm. Simulations are first carried on a benchmark power system modeled with two state variables. Then we present a more complex case study on a four-machine power system where the reinforcement learning algorithm controls a Thyristor Controlled Series Capacitor (TCSC) aimed to damp power system oscillations.


2021 ◽  
pp. 2150011
Author(s):  
Wei Dong ◽  
Jianan Wang ◽  
Chunyan Wang ◽  
Zhenqiang Qi ◽  
Zhengtao Ding

In this paper, the optimal consensus control problem is investigated for heterogeneous linear multi-agent systems (MASs) with spanning tree condition based on game theory and reinforcement learning. First, the graphical minimax game algebraic Riccati equation (ARE) is derived by converting the consensus problem into a zero-sum game problem between each agent and its neighbors. The asymptotic stability and minimax validation of the closed-loop systems are proved theoretically. Then, a data-driven off-policy reinforcement learning algorithm is proposed to online learn the optimal control policy without the information of the system dynamics. A certain rank condition is established to guarantee the convergence of the proposed algorithm to the unique solution of the ARE. Finally, the effectiveness of the proposed method is demonstrated through a numerical simulation.


Sign in / Sign up

Export Citation Format

Share Document