Accelerating the training of deep reinforcement learning in  autonomous driving

Deep reinforcement learning has been successful in solving common autonomous driving tasks such as lane-keeping by simply using pixel data from the front view camera as input. However, raw pixel data contains a very high-dimensional observation that affects the learning quality of the agent due to the complexity imposed by a 'realistic' urban environment. Ergo, we investigate how compressing the raw pixel data from high-dimensional state to low-dimensional latent space offline using a variational autoencoder can significantly improve the training of a deep reinforcement learning agent. We evaluated our method on a simulated autonomous vehicle in car learning to act and compared our results with many baselines including deep deterministic policy gradient, proximal policy optimization, and soft actorcritic. The result shows that the method greatly accelerates the training time and there was a remarkable improvement in the quality of the deep reinforcement learning agent.

Download Full-text

Controlling Agents by Constrained Policy Updates

SYSTEM THEORY, CONTROL AND COMPUTING JOURNAL ◽

10.52846/stccj.2021.1.2.24 ◽

2021 ◽

Vol 1 (2) ◽

pp. 33-39

Author(s):

Mónika Farsang ◽

Luca Szegletes

Keyword(s):

Gradient Methods ◽

Poor Performance ◽

High Dimensional ◽

Complex Behavior ◽

Clear Trend ◽

Learned Behavior ◽

Optimal Behavior ◽

Policy Gradient ◽

Low Dimensional ◽

Policy Optimization

Learning the optimal behavior is the ultimate goal in reinforcement learning. This can be achieved by many different approaches, the most successful of them are policy gradient methods. However, they can suffer from undesirably large updates of policies, leading to poor performance. In recent years there has been a clear trend toward designing more reliable algorithms. This paper addresses to examine different restriction strategies applied to the widely used Proximal Policy Optimization (PPO-Clip) technique. We also question whether the analyzed methods are able to adapt not only to low-dimensional tasks but also to complex, high-dimensional problems in control and robotic domains. The analysis of the learned behavior shows that these methods can lead to better performance compared to the original PPO-Clip algorithm, moreover, they are also able to achieve complex behavior and policies in high-dimensional environments.

Download Full-text

An Efficiency Enhancing Methodology for Multiple Autonomous Vehicles in an Urban Network Adopting Deep Reinforcement Learning

Applied Sciences ◽

10.3390/app11041514 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1514 ◽

Cited By ~ 2

Author(s):

Quang-Duy Tran ◽

Sang-Hoon Bae

Keyword(s):

Reinforcement Learning ◽

Traffic Congestion ◽

Autonomous Vehicles ◽

Penetration Rate ◽

Autonomous Vehicle ◽

Effective Means ◽

Urban Network ◽

Learning Agents ◽

Policy Optimization ◽

The Impact

To reduce the impact of congestion, it is necessary to improve our overall understanding of the influence of the autonomous vehicle. Recently, deep reinforcement learning has become an effective means of solving complex control tasks. Accordingly, we show an advanced deep reinforcement learning that investigates how the leading autonomous vehicles affect the urban network under a mixed-traffic environment. We also suggest a set of hyperparameters for achieving better performance. Firstly, we feed a set of hyperparameters into our deep reinforcement learning agents. Secondly, we investigate the leading autonomous vehicle experiment in the urban network with different autonomous vehicle penetration rates. Thirdly, the advantage of leading autonomous vehicles is evaluated using entire manual vehicle and leading manual vehicle experiments. Finally, the proximal policy optimization with a clipped objective is compared to the proximal policy optimization with an adaptive Kullback–Leibler penalty to verify the superiority of the proposed hyperparameter. We demonstrate that full automation traffic increased the average speed 1.27 times greater compared with the entire manual vehicle experiment. Our proposed method becomes significantly more effective at a higher autonomous vehicle penetration rate. Furthermore, the leading autonomous vehicles could help to mitigate traffic congestion.

Download Full-text

PQROM: To optimize software defined network QoS-aware routing with proximal policy optimization

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211787 ◽

2021 ◽

pp. 1-10

Author(s):

Wei Zhou ◽

Xing Jiang ◽

Bingli Guo (Member, IEEE) ◽

Lingyu Meng

Keyword(s):

Software Defined Network ◽

Training Time ◽

Good Convergence ◽

Reward Function ◽

Routing Optimization ◽

Policy Gradient ◽

Discrete Action ◽

Policy Optimization ◽

Optimization Mechanism ◽

Network Pattern

Currently, Quality-of-Service (QoS)-aware routing is one of the crucial challenges in Software Defined Network (SDN). The QoS performances, e.g. latency, packet loss ratio and throughput, must be optimized to improve the performance of network. Traditional static routing algorithms based on Open Shortest Path First (OSPF) could not adapt to traffic fluctuation, which may cause severe network congestion and service degradation. Central intelligence of SDN controller and recent breakthroughs of Deep Reinforcement Learning (DRL) pose a promising solution to tackle this challenge. Thus, we propose an on-policy DRL mechanism, namely the PPO-based (Proximal Policy Optimization) QoS-aware Routing Optimization Mechanism (PQROM), to achieve a general and re-customizable routing optimization. PQROM can dynamically update the routing calculation by adjusting the reward function according to different optimization objectives, and it is independent of any specific network pattern. Additionally, as a black-box one-step optimization, PQROM is qualified for both continuous and discrete action space with high-dimensional input and output. The OMNeT ++ simulation experiment results show that PQROM not only has good convergence, but also has better stability compared with OSPF, less training time and simpler hyper-parameters adjustment than Deep Deterministic Policy Gradient (DDPG) and less hardware consumption than Asynchronous Advantage Actor-Critic (A3C).

Download Full-text

Deep reinforcement learning based control for Autonomous Vehicles in CARLA

Multimedia Tools and Applications ◽

10.1007/s11042-021-11437-3 ◽

2022 ◽

Author(s):

Óscar Pérez-Gil ◽

Rafael Barea ◽

Elena López-Guillén ◽

Luis M. Bergasa ◽

Carlos Gómez-Huélamo ◽

...

Keyword(s):

Reinforcement Learning ◽

Autonomous Vehicles ◽

Autonomous Vehicle ◽

Vehicle Control ◽

Data Sources ◽

Simulation Environment ◽

Urban Simulation ◽

Policy Gradient ◽

Almost All ◽

Control Layer

AbstractNowadays, Artificial Intelligence (AI) is growing by leaps and bounds in almost all fields of technology, and Autonomous Vehicles (AV) research is one more of them. This paper proposes the using of algorithms based on Deep Learning (DL) in the control layer of an autonomous vehicle. More specifically, Deep Reinforcement Learning (DRL) algorithms such as Deep Q-Network (DQN) and Deep Deterministic Policy Gradient (DDPG) are implemented in order to compare results between them. The aim of this work is to obtain a trained model, applying a DRL algorithm, able of sending control commands to the vehicle to navigate properly and efficiently following a determined route. In addition, for each of the algorithms, several agents are presented as a solution, so that each of these agents uses different data sources to achieve the vehicle control commands. For this purpose, an open-source simulator such as CARLA is used, providing to the system with the ability to perform a multitude of tests without any risk into an hyper-realistic urban simulation environment, something that is unthinkable in the real world. The results obtained show that both DQN and DDPG reach the goal, but DDPG obtains a better performance. DDPG perfoms trajectories very similar to classic controller as LQR. In both cases RMSE is lower than 0.1m following trajectories with a range 180-700m. To conclude, some conclusions and future works are commented.

Download Full-text

Research on the Multiagent Joint Proximal Policy Optimization Algorithm Controlling Cooperative Fixed-Wing UAV Obstacle Avoidance

Sensors ◽

10.3390/s20164546 ◽

2020 ◽

Vol 20 (16) ◽

pp. 4546

Author(s):

Weiwei Zhao ◽

Hairong Chu ◽

Xikui Miao ◽

Lihong Guo ◽

Honghai Shen ◽

...

Keyword(s):

Reinforcement Learning ◽

Attitude Control ◽

Cooperative Control ◽

Learning Algorithm ◽

State Equations ◽

Learning Agent ◽

Environmental Adaptability ◽

Decentralized Execution ◽

Policy Optimization ◽

Multi Uav

Multiple unmanned aerial vehicle (UAV) collaboration has great potential. To increase the intelligence and environmental adaptability of multi-UAV control, we study the application of deep reinforcement learning algorithms in the field of multi-UAV cooperative control. Aiming at the problem of a non-stationary environment caused by the change of learning agent strategy in reinforcement learning in a multi-agent environment, the paper presents an improved multiagent reinforcement learning algorithm—the multiagent joint proximal policy optimization (MAJPPO) algorithm with the centralized learning and decentralized execution. This algorithm uses the moving window averaging method to make each agent obtain a centralized state value function, so that the agents can achieve better collaboration. The improved algorithm enhances the collaboration and increases the sum of reward values obtained by the multiagent system. To evaluate the performance of the algorithm, we use the MAJPPO algorithm to complete the task of multi-UAV formation and the crossing of multiple-obstacle environments. To simplify the control complexity of the UAV, we use the six-degree of freedom and 12-state equations of the dynamics model of the UAV with an attitude control loop. The experimental results show that the MAJPPO algorithm has better performance and better environmental adaptability.

Download Full-text

Proximal Policy Optimization Through a Deep Reinforcement Learning Framework for Multiple Autonomous Vehicles at a Non-Signalized Intersection

Applied Sciences ◽

10.3390/app10165722 ◽

2020 ◽

Vol 10 (16) ◽

pp. 5722 ◽

Cited By ~ 1

Author(s):

Duy Quang Tran ◽

Sang-Hoon Bae

Keyword(s):

Reinforcement Learning ◽

Autonomous Vehicles ◽

Autonomous Vehicle ◽

Signalized Intersection ◽

Continuous Control ◽

Simulation Performance ◽

Perceptron Algorithm ◽

Learning Framework ◽

Positive Effects ◽

Policy Optimization

Advanced deep reinforcement learning shows promise as an approach to addressing continuous control tasks, especially in mixed-autonomy traffic. In this study, we present a deep reinforcement-learning-based model that considers the effectiveness of leading autonomous vehicles in mixed-autonomy traffic at a non-signalized intersection. This model integrates the Flow framework, the simulation of urban mobility simulator, and a reinforcement learning library. We also propose a set of proximal policy optimization hyperparameters to obtain reliable simulation performance. First, the leading autonomous vehicles at the non-signalized intersection are considered with varying autonomous vehicle penetration rates that range from 10% to 100% in 10% increments. Second, the proximal policy optimization hyperparameters are input into the multiple perceptron algorithm for the leading autonomous vehicle experiment. Finally, the superiority of the proposed model is evaluated using all human-driven vehicle and leading human-driven vehicle experiments. We demonstrate that full-autonomy traffic can improve the average speed and delay time by 1.38 times and 2.55 times, respectively, compared with all human-driven vehicle experiments. Our proposed method generates more positive effects when the autonomous vehicle penetration rate increases. Additionally, the leading autonomous vehicle experiment can be used to dissipate the stop-and-go waves at a non-signalized intersection.

Download Full-text

Quadrotor Motion Control Using Deep Reinforcement Learning

Journal of Unmanned Vehicle Systems ◽

10.1139/juvs-2021-0010 ◽

2021 ◽

Author(s):

Zifei Jiang ◽

Alan F. Lynch

Keyword(s):

Reinforcement Learning ◽

Neural Nets ◽

Neural Net ◽

Reward Function ◽

Model Free ◽

Policy Gradient ◽

Aerial Vehicle ◽

Stochastic Controller ◽

Policy Optimization ◽

Gradient Approach

We present a deep neural net-based controller trained by a model-free reinforcement learning (RL) algorithm to achieve hover stabilization for a quadrotor unmanned aerial vehicle (UAV). With RL, two neural nets are trained. One neural net is used as a stochastic controller which gives the distribution of control inputs. The other maps the UAV state to a scalar which estimates the reward of the controller. A proximal policy optimization (PPO) method, which is an actor-critic policy gradient approach, is used to train the neural nets. Simulation results show that the trained controller achieves a comparable level of performance to a manually-tuned PID controller, despite not depending on any model information. The paper considers different choices of reward function and their influence on controller performance.

Download Full-text

A Pivot-Based Routine for Improved Parent-Finding in Hybrid MDS

Information Visualization ◽

10.1057/palgrave.ivs.9500069 ◽

2004 ◽

Vol 3 (2) ◽

pp. 109-122 ◽

Cited By ~ 18

Author(s):

Alistair Morrison ◽

Matthew Chalmers

Keyword(s):

Negative Impact ◽

Dimensional Space ◽

High Dimensional ◽

Spring Model ◽

Constant Number ◽

Data Set ◽

Original Algorithm ◽

Bottle Neck ◽

Low Dimensional

The problem of exploring or visualising data of high dimensionality is central to many tools for information visualisation. Through representing a data set in terms of inter-object proximities, multidimensional scaling may be employed to generate a configuration of objects in low-dimensional space in such a way as to preserve high-dimensional relationships. An algorithm is presented here for a heuristic hybrid model for the generation of such configurations. Building on a model introduced in 2002, the algorithm functions by means of sampling, spring model and interpolation phases. The most computationally complex stage of the original algorithm involved the execution of a series of nearest-neighbour searches. In this paper, we describe how the complexity of this phase has been reduced by treating all high-dimensional relationships as a set of discretised distances to a constant number of randomly selected items: pivots. In improving this computational bottle-neck, the algorithmic complexity is reduced from O( N√N) to O( N5/4). As well as documenting this improvement, the paper describes evaluation with a data set of 108,000 13-dimensional items and a set of 23,141 17-dimensional items. Results illustrate that the reduction in complexity is reflected in significantly improved run times and that no negative impact is made upon the quality of layout produced.

Download Full-text

Object Affordance Driven Inverse Reinforcement Learning Through Conceptual Abstraction and Advice

Paladyn Journal of Behavioral Robotics ◽

10.1515/pjbr-2018-0021 ◽

2018 ◽

Vol 9 (1) ◽

pp. 277-294 ◽

Cited By ~ 1

Author(s):

Rupam Bhattacharyya ◽

Shyamanta M. Hazarika

Keyword(s):

Reinforcement Learning ◽

High Dimensional ◽

Inverse Reinforcement Learning ◽

Intent Recognition ◽

Reward Function ◽

Object Affordances ◽

Learning Agent ◽

Markov Decision ◽

Observed Behaviour ◽

Object Affordance

Abstract Within human Intent Recognition (IR), a popular approach to learning from demonstration is Inverse Reinforcement Learning (IRL). IRL extracts an unknown reward function from samples of observed behaviour. Traditional IRL systems require large datasets to recover the underlying reward function. Object affordances have been used for IR. Existing literature on recognizing intents through object affordances fall short of utilizing its true potential. In this paper, we seek to develop an IRL system which drives human intent recognition along with the capability to handle high dimensional demonstrations exploiting the capability of object affordances. An architecture for recognizing human intent is presented which consists of an extended Maximum Likelihood Inverse Reinforcement Learning agent. Inclusion of Symbolic Conceptual Abstraction Engine (SCAE) along with an advisor allows the agent to work on Conceptually Abstracted Markov Decision Process. The agent recovers object affordance based reward function from high dimensional demonstrations. This function drives a Human Intent Recognizer through identification of probable intents. Performance of the resulting system on the standard CAD-120 dataset shows encouraging result.

Download Full-text

Predicting the Future: Advantages of Semilocal Units

Neural Computation ◽

10.1162/neco.1991.3.4.566 ◽

1991 ◽

Vol 3 (4) ◽

pp. 566-578 ◽

Cited By ~ 59

Author(s):

Eric Hartman ◽

James D. Keeler

Keyword(s):

Gradient Descent ◽

High Dimensional ◽

Activation Functions ◽

Learning Problem ◽

Rbf Networks ◽

Present Evidence ◽

Training Time ◽

The Difference ◽

Low Dimensional ◽

Slow Learning

In investigating gaussian radial basis function (RBF) networks for their ability to model nonlinear time series, we have found that while RBF networks are much faster than standard sigmoid unit backpropagation for low-dimensional problems, their advantages diminish in high-dimensional input spaces. This is particularly troublesome if the input space contains irrelevant variables. We suggest that this limitation is due to the localized nature of RBFs. To gain the advantages of the highly nonlocal sigmoids and the speed advantages of RBFs, we propose a particular class of semilocal activation functions that is a natural interpolation between these two families. We present evidence that networks using these gaussian bar units avoid the slow learning problem of sigmoid unit networks, and, very importantly, are more accurate than RBF networks in the presence of irrelevant inputs. On the Mackey-Glass and Coupled Lattice Map problems, the speedup over sigmoid networks is so dramatic that the difference in training time between RBF and gaussian bar networks is minor. Gaussian bar architectures that superpose composed gaussians (gaussians-of-gaussians) to approximate the unknown function have the best performance. We postulate that an interesing behavior displayed by gaussian bar functions under gradient descent dynamics, which we call automatic connection pruning, is an important factor in the success of this representation.

Download Full-text