Average-Reward Reinforcement Learning with Trust Region Methods

Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. To the best of our knowledge, our work is the first one to study the trust region approach with the average criterion and it complements the framework of reinforcement learning beyond the discounted criterion. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.

Download Full-text

Proximal Policy Optimization Through a Deep Reinforcement Learning Framework for Multiple Autonomous Vehicles at a Non-Signalized Intersection

Applied Sciences ◽

10.3390/app10165722 ◽

2020 ◽

Vol 10 (16) ◽

pp. 5722 ◽

Cited By ~ 1

Author(s):

Duy Quang Tran ◽

Sang-Hoon Bae

Keyword(s):

Reinforcement Learning ◽

Autonomous Vehicles ◽

Autonomous Vehicle ◽

Signalized Intersection ◽

Continuous Control ◽

Simulation Performance ◽

Perceptron Algorithm ◽

Learning Framework ◽

Positive Effects ◽

Policy Optimization

Advanced deep reinforcement learning shows promise as an approach to addressing continuous control tasks, especially in mixed-autonomy traffic. In this study, we present a deep reinforcement-learning-based model that considers the effectiveness of leading autonomous vehicles in mixed-autonomy traffic at a non-signalized intersection. This model integrates the Flow framework, the simulation of urban mobility simulator, and a reinforcement learning library. We also propose a set of proximal policy optimization hyperparameters to obtain reliable simulation performance. First, the leading autonomous vehicles at the non-signalized intersection are considered with varying autonomous vehicle penetration rates that range from 10% to 100% in 10% increments. Second, the proximal policy optimization hyperparameters are input into the multiple perceptron algorithm for the leading autonomous vehicle experiment. Finally, the superiority of the proposed model is evaluated using all human-driven vehicle and leading human-driven vehicle experiments. We demonstrate that full-autonomy traffic can improve the average speed and delay time by 1.38 times and 2.55 times, respectively, compared with all human-driven vehicle experiments. Our proposed method generates more positive effects when the autonomous vehicle penetration rate increases. Additionally, the leading autonomous vehicle experiment can be used to dissipate the stop-and-go waves at a non-signalized intersection.

Download Full-text

A Deep Reinforcement Learning Based Car Following Model for Electric Vehicle

智能城市应用 ◽

10.33142/sca.v2i5.813 ◽

2019 ◽

Vol 2 (5) ◽

Author(s):

Yuankai Wu ◽

Huachun Tan ◽

Jiankun Peng ◽

Bin Ran

Keyword(s):

Reinforcement Learning ◽

Electric Vehicle ◽

Electricity Consumption ◽

Trust Region ◽

Research Area ◽

Data Driven ◽

Car Following ◽

Car Following Model ◽

On The Road ◽

Policy Optimization

Car following (CF) models are an appealing research area because they fundamentally describe longitudinal interactions of vehicles on the road, and contribute significantly to an understanding of traffic flow. There is an emerging trend to use data-driven method to build CF models. One challenge to the data-driven CF models is their capability to achieve optimal longitudinal driven behavior because a lot of bad driving behaviors will be learnt from human drivers by the supervised learning manner. In this study, by utilizing the deep reinforcement learning (DRL) techniques trust region policy optimization (TRPO), a DRL based CF model for electric vehicle (EV) is built. The proposed CF model can learn optimal driving behavior by itself in simulation. The experiments on following standard driving cycle show that the DRL model outperforms the traditional CF model in terms of electricity consumption.

Download Full-text

Data-Driven Online Energy Scheduling of a Microgrid Based on Deep Reinforcement Learning

Energies ◽

10.3390/en14082120 ◽

2021 ◽

Vol 14 (8) ◽

pp. 2120

Author(s):

Ying Ji ◽

Jianhui Wang ◽

Jiacan Xu ◽

Donglin Li

Keyword(s):

Reinforcement Learning ◽

Operating Cost ◽

Online Scheduling ◽

Optimal Scheduling ◽

Data Driven ◽

High Dimensional ◽

Continuous Control ◽

Renewable Energy Resources ◽

Continuous Actions ◽

Policy Optimization

The proliferation of distributed renewable energy resources (RESs) poses major challenges to the operation of microgrids due to uncertainty. Traditional online scheduling approaches relying on accurate forecasts become difficult to implement due to the increase of uncertain RESs. Although several data-driven methods have been proposed recently to overcome the challenge, they generally suffer from a scalability issue due to the limited ability to optimize high-dimensional continuous control variables. To address these issues, we propose a data-driven online scheduling method for microgrid energy optimization based on continuous-control deep reinforcement learning (DRL). We formulate the online scheduling problem as a Markov decision process (MDP). The objective is to minimize the operating cost of the microgrid considering the uncertainty of RESs generation, load demand, and electricity prices. To learn the optimal scheduling strategy, a Gated Recurrent Unit (GRU)-based network is designed to extract temporal features of uncertainty and generate the optimal scheduling decisions in an end-to-end manner. To optimize the policy with high-dimensional and continuous actions, proximal policy optimization (PPO) is employed to train the neural network-based policy in a data-driven fashion. The proposed method does not require any forecasting information on the uncertainty or a prior knowledge of the physical model of the microgrid. Simulation results using realistic power system data of California Independent System Operator (CAISO) demonstrate the effectiveness of the proposed method.

Download Full-text

Missile guidance with assisted deep reinforcement learning for head-on interception of maneuvering target

Complex & Intelligent Systems ◽

10.1007/s40747-021-00577-6 ◽

2021 ◽

Author(s):

Weifan Li ◽

Yuanheng Zhu ◽

Dongbin Zhao

Keyword(s):

Reinforcement Learning ◽

Learning Task ◽

Imitation Learning ◽

Continuous Control ◽

Missile Guidance ◽

Maneuvering Target ◽

Detection Delay ◽

The Neural Network ◽

The Mean ◽

Policy Optimization

AbstractIn missile guidance, pursuit performance is seriously degraded due to the uncertainty and randomness in target maneuverability, detection delay, and environmental noise. In many methods, accurately estimating the acceleration of the target or the time-to-go is needed to intercept the maneuvering target, which is hard in an environment with uncertainty. In this paper, we propose an assisted deep reinforcement learning (ARL) algorithm to optimize the neural network-based missile guidance controller for head-on interception. Based on the relative velocity, distance, and angle, ARL can control the missile to intercept the maneuvering target and achieve large terminal intercept angle. To reduce the influence of environmental uncertainty, ARL predicts the target’s acceleration as an auxiliary supervised task. The supervised learning task improves the ability of the agent to extract information from observations. To exploit the agent’s good trajectories, ARL presents the Gaussian self-imitation learning to make the mean of action distribution approach the agent’s good actions. Compared with vanilla self-imitation learning, Gaussian self-imitation learning improves the exploration in continuous control. Simulation results validate that ARL outperforms traditional methods and proximal policy optimization algorithm with higher hit rate and larger terminal intercept angle in the simulation environment with noise, delay, and maneuverable target.

Download Full-text

An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning

IEEE Transactions on Neural Networks and Learning Systems ◽

10.1109/tnnls.2020.3044196 ◽

2021 ◽

pp. 1-13

Author(s):

Wenjia Meng ◽

Qian Zheng ◽

Yue Shi ◽

Gang Pan

Keyword(s):

Reinforcement Learning ◽

Trust Region ◽

Optimization Method ◽

Policy Optimization

Download Full-text

Aero-engine acceleration control using deep reinforcement learning with phase-based reward function

Proceedings of the Institution of Mechanical Engineers Part G Journal of Aerospace Engineering ◽

10.1177/09544100211046225 ◽

2021 ◽

pp. 095441002110462

Author(s):

Qian-Kun Hu ◽

Yong-Ping Zhao

Keyword(s):

Reinforcement Learning ◽

Trust Region ◽

Engine Control ◽

Control Task ◽

Q Learning ◽

Reward Function ◽

Engine Control System ◽

Aero Engine ◽

Markov Decision ◽

Policy Optimization

In this paper, the conventional aero-engine acceleration control task is formulated into a Markov Decision Process (MDP) problem. Then, a novel phase-based reward function is proposed to enhance the performance of deep reinforcement learning (DRL) in solving feedback control tasks. With that reward function, an aero-engine controller based on Trust Region Policy Optimization (TRPO) is developed to improve the aero-engine acceleration performance. Four comparison simulations were conducted to verify the effectiveness of the proposed methods. The simulation results show that the phase-based reward function helps to eliminate the oscillation problem of the aero-engine control system, which is caused by the traditional goal-based reward function when DRL is applied to the aero-engine control. And the TRPO controller outperforms deep Q-learning (DQN) and the proportional-integral-derivative (PID) in the aero-engine acceleration control task. Compared to DQN and PID controller, the acceleration time of aero-engine is decreased by 0.6 and 2.58 s, respectively, and the aero-engine acceleration performance is improved by 16.8 and 46.4 % each.

Download Full-text

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6021 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5668-5675

Author(s):

Lior Shani ◽

Yonathan Efroni ◽

Shie Mannor

Keyword(s):

Reinforcement Learning ◽

Global Convergence ◽

Search Algorithm ◽

Trust Region ◽

Global Optimum ◽

Trust Region Methods ◽

Policy Search ◽

Conservative Policy ◽

Policy Optimization ◽

Fast Rates

Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be ‘close’ to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural “RL version” of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish Õ(1/√N) convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of Õ(1/N), much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.

Download Full-text

Air Learning: a deep reinforcement learning gym for autonomous aerial robot visual navigation

Machine Learning ◽

10.1007/s10994-021-06006-6 ◽

2021 ◽

Author(s):

Srivatsan Krishnan ◽

Behzad Boroujerdian ◽

William Fu ◽

Aleksandra Faust ◽

Vijay Janapa Reddi

Keyword(s):

Reinforcement Learning ◽

Embedded System ◽

Broad Class ◽

Visual Navigation ◽

Raspberry Pi ◽

Latency Distribution ◽

Hardware In The Loop ◽

Resource Constrained ◽

Aerial Robot ◽

Policy Optimization

AbstractWe introduce Air Learning, an open-source simulator, and a gym environment for deep reinforcement learning research on resource-constrained aerial robots. Equipped with domain randomization, Air Learning exposes a UAV agent to a diverse set of challenging scenarios. We seed the toolset with point-to-point obstacle avoidance tasks in three different environments and Deep Q Networks (DQN) and Proximal Policy Optimization (PPO) trainers. Air Learning assesses the policies’ performance under various quality-of-flight (QoF) metrics, such as the energy consumed, endurance, and the average trajectory length, on resource-constrained embedded platforms like a Raspberry Pi. We find that the trajectories on an embedded Ras-Pi are vastly different from those predicted on a high-end desktop system, resulting in up to $$40\%$$ 40 % longer trajectories in one of the environments. To understand the source of such discrepancies, we use Air Learning to artificially degrade high-end desktop performance to mimic what happens on a low-end embedded system. We then propose a mitigation technique that uses the hardware-in-the-loop to determine the latency distribution of running the policy on the target platform (onboard compute on aerial robot). A randomly sampled latency from the latency distribution is then added as an artificial delay within the training loop. Training the policy with artificial delays allows us to minimize the hardware gap (discrepancy in the flight time metric reduced from 37.73% to 0.5%). Thus, Air Learning with hardware-in-the-loop characterizes those differences and exposes how the onboard compute’s choice affects the aerial robot’s performance. We also conduct reliability studies to assess the effect of sensor failures on the learned policies. All put together, Air Learning enables a broad class of deep RL research on UAVs. The source code is available at: https://github.com/harvard-edge/AirLearning.

Download Full-text

An Efficiency Enhancing Methodology for Multiple Autonomous Vehicles in an Urban Network Adopting Deep Reinforcement Learning

Applied Sciences ◽

10.3390/app11041514 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1514 ◽

Cited By ~ 2

Author(s):

Quang-Duy Tran ◽

Sang-Hoon Bae

Keyword(s):

Reinforcement Learning ◽

Traffic Congestion ◽

Autonomous Vehicles ◽

Penetration Rate ◽

Autonomous Vehicle ◽

Effective Means ◽

Urban Network ◽

Learning Agents ◽

Policy Optimization ◽

The Impact

To reduce the impact of congestion, it is necessary to improve our overall understanding of the influence of the autonomous vehicle. Recently, deep reinforcement learning has become an effective means of solving complex control tasks. Accordingly, we show an advanced deep reinforcement learning that investigates how the leading autonomous vehicles affect the urban network under a mixed-traffic environment. We also suggest a set of hyperparameters for achieving better performance. Firstly, we feed a set of hyperparameters into our deep reinforcement learning agents. Secondly, we investigate the leading autonomous vehicle experiment in the urban network with different autonomous vehicle penetration rates. Thirdly, the advantage of leading autonomous vehicles is evaluated using entire manual vehicle and leading manual vehicle experiments. Finally, the proximal policy optimization with a clipped objective is compared to the proximal policy optimization with an adaptive Kullback–Leibler penalty to verify the superiority of the proposed hyperparameter. We demonstrate that full automation traffic increased the average speed 1.27 times greater compared with the entire manual vehicle experiment. Our proposed method becomes significantly more effective at a higher autonomous vehicle penetration rate. Furthermore, the leading autonomous vehicles could help to mitigate traffic congestion.

Download Full-text

Fully distributed actor-critic architecture for multitask deep reinforcement learning

The Knowledge Engineering Review ◽

10.1017/s0269888921000023 ◽

2021 ◽

Vol 36 ◽

Author(s):

Sergio Valcarcel Macua ◽

Ian Davies ◽

Aleksi Tukiainen ◽

Enrique Munoz de Cote

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Duality Theory ◽

Deep Neural Network ◽

Original Problem ◽

Almost Sure Convergence ◽

Continuous Control ◽

Access Data ◽

Central Station ◽

Common Policy

Abstract We propose a fully distributed actor-critic architecture, named diffusion-distributed-actor-critic Diff-DAC, with application to multitask reinforcement learning (MRL). During the learning process, agents communicate their value and policy parameters to their neighbours, diffusing the information across a network of agents with no need for a central station. Each agent can only access data from its local task, but aims to learn a common policy that performs well for the whole set of tasks. The architecture is scalable, since the computational and communication cost per agent depends on the number of neighbours rather than the overall number of agents. We derive Diff-DAC from duality theory and provide novel insights into the actor-critic framework, showing that it is actually an instance of the dual-ascent method. We prove almost sure convergence of Diff-DAC to a common policy under general assumptions that hold even for deep neural network approximations. For more restrictive assumptions, we also prove that this common policy is a stationary point of an approximation of the original problem. Numerical results on multitask extensions of common continuous control benchmarks demonstrate that Diff-DAC stabilises learning and has a regularising effect that induces higher performance and better generalisation properties than previous architectures.

Download Full-text