Multiagent Simulation On Hide and Seek Games Using Policy Gradient Trust Region Policy Optimization

2021 ◽

pp. 1-10

Author(s):

Wei Zhou ◽

Xing Jiang ◽

Bingli Guo (Member, IEEE) ◽

Lingyu Meng

Keyword(s):

Software Defined Network ◽

Training Time ◽

Good Convergence ◽

Reward Function ◽

Routing Optimization ◽

Policy Gradient ◽

Discrete Action ◽

Policy Optimization ◽

Optimization Mechanism ◽

Network Pattern

Currently, Quality-of-Service (QoS)-aware routing is one of the crucial challenges in Software Defined Network (SDN). The QoS performances, e.g. latency, packet loss ratio and throughput, must be optimized to improve the performance of network. Traditional static routing algorithms based on Open Shortest Path First (OSPF) could not adapt to traffic fluctuation, which may cause severe network congestion and service degradation. Central intelligence of SDN controller and recent breakthroughs of Deep Reinforcement Learning (DRL) pose a promising solution to tackle this challenge. Thus, we propose an on-policy DRL mechanism, namely the PPO-based (Proximal Policy Optimization) QoS-aware Routing Optimization Mechanism (PQROM), to achieve a general and re-customizable routing optimization. PQROM can dynamically update the routing calculation by adjusting the reward function according to different optimization objectives, and it is independent of any specific network pattern. Additionally, as a black-box one-step optimization, PQROM is qualified for both continuous and discrete action space with high-dimensional input and output. The OMNeT ++ simulation experiment results show that PQROM not only has good convergence, but also has better stability compared with OSPF, less training time and simpler hyper-parameters adjustment than Deep Deterministic Policy Gradient (DDPG) and less hardware consumption than Asynchronous Advantage Actor-Critic (A3C).

Download Full-text

Reinforcement learning-based radar-evasive path planning: a comparative analysis

The Aeronautical Journal ◽

10.1017/aer.2021.85 ◽

2021 ◽

pp. 1-18

Author(s):

R.U. Hameed ◽

A. Maqsood ◽

A.J. Hashmi ◽

M.T. Saeed ◽

R. Riaz

Keyword(s):

Reinforcement Learning ◽

Path Planning ◽

Learning Algorithms ◽

Optimal Path ◽

Trust Region ◽

Optimal Path Planning ◽

Optimal Paths ◽

Policy Gradient ◽

Model Aircraft ◽

Tracking Model

Abstract This paper discusses the utilisation of deep reinforcement learning algorithms to obtain optimal paths for an aircraft to avoid or minimise radar detection and tracking. A modular approach is adopted to formulate the problem, including the aircraft kinematics model, aircraft radar cross-section model and radar tracking model. A virtual environment is designed for single and multiple radar cases to obtain optimal paths. The optimal trajectories are generated through deep reinforcement learning in this study. Specifically, three algorithms, namely deep deterministic policy gradient, trust region policy optimisation and proximal policy optimisation, are used to find optimal paths for five test cases. The comparison is carried out based on six performance indicators. The investigation proves the importance of these reinforcement learning algorithms in optimal path planning. The results indicate that the proximal policy optimisation approach performed better for optimal paths in general.

Download Full-text

Learning Quadcopter Maneuvers with Concurrent Methods of Policy Optimization

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2017.p0639 ◽

2017 ◽

Vol 21 (4) ◽

pp. 639-649

Author(s):

Pei-Hua Huang ◽

◽

Osamu Hasegawa

Keyword(s):

Trust Region ◽

Control Policy ◽

Asynchronous Learning ◽

State Action ◽

Learning Framework ◽

Learning Speed ◽

Continuous State ◽

Continuous Actions ◽

Policy Optimization ◽

Robotic Application

This study presents an aerial robotic application of deep reinforcement learning that imparts an asynchronous learning framework and trust region policy optimization to a simulated quad-rotor helicopter (quadcopter) environment. In particular, we optimized a control policy asynchronously through interaction with concurrent instances of the environment. The control system was benchmarked and extended with examples to tackle continuous state-action tasks for the quadcoptor: hovering control and balancing an inverted pole. Performing these maneuvers required continuous actions for sensitive control of small acceleration changes of the quadcoptor, thereby maximizing the scalar reward of the defined tasks. The simulation results demonstrated an enhancement of the learning speed and reliability for the tasks.

Download Full-text

Hindsight Trust Region Policy Optimization

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/459 ◽

2021 ◽

Author(s):

Hanbo Zhang ◽

Site Bai ◽

Xuguang Lan ◽

David Hsu ◽

Nanning Zheng

Keyword(s):

Robot Control ◽

State Of The Art ◽

Convergence Property ◽

Trust Region ◽

Gradient Algorithm ◽

Learning Performance ◽

Kl Divergence ◽

Art Policy ◽

Current Task ◽

Policy Optimization

Reinforcement Learning (RL) with sparse rewards is a major challenge. We pro- pose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm’s ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Experimental results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy 14 gradient algorithm for RL with sparse rewards.

Download Full-text

A Deep Reinforcement Learning Based Car Following Model for Electric Vehicle

智能城市应用 ◽

10.33142/sca.v2i5.813 ◽

2019 ◽

Vol 2 (5) ◽

Author(s):

Yuankai Wu ◽

Huachun Tan ◽

Jiankun Peng ◽

Bin Ran

Keyword(s):

Reinforcement Learning ◽

Electric Vehicle ◽

Electricity Consumption ◽

Trust Region ◽

Research Area ◽

Data Driven ◽

Car Following ◽

Car Following Model ◽

On The Road ◽

Policy Optimization

Car following (CF) models are an appealing research area because they fundamentally describe longitudinal interactions of vehicles on the road, and contribute significantly to an understanding of traffic flow. There is an emerging trend to use data-driven method to build CF models. One challenge to the data-driven CF models is their capability to achieve optimal longitudinal driven behavior because a lot of bad driving behaviors will be learnt from human drivers by the supervised learning manner. In this study, by utilizing the deep reinforcement learning (DRL) techniques trust region policy optimization (TRPO), a DRL based CF model for electric vehicle (EV) is built. The proposed CF model can learn optimal driving behavior by itself in simulation. The experiments on following standard driving cycle show that the DRL model outperforms the traditional CF model in terms of electricity consumption.

Download Full-text

Quadrotor Motion Control Using Deep Reinforcement Learning

Journal of Unmanned Vehicle Systems ◽

10.1139/juvs-2021-0010 ◽

2021 ◽

Author(s):

Zifei Jiang ◽

Alan F. Lynch

Keyword(s):

Reinforcement Learning ◽

Neural Nets ◽

Neural Net ◽

Reward Function ◽

Model Free ◽

Policy Gradient ◽

Aerial Vehicle ◽

Stochastic Controller ◽

Policy Optimization ◽

Gradient Approach

We present a deep neural net-based controller trained by a model-free reinforcement learning (RL) algorithm to achieve hover stabilization for a quadrotor unmanned aerial vehicle (UAV). With RL, two neural nets are trained. One neural net is used as a stochastic controller which gives the distribution of control inputs. The other maps the UAV state to a scalar which estimates the reward of the controller. A proximal policy optimization (PPO) method, which is an actor-critic policy gradient approach, is used to train the neural nets. Simulation results show that the trained controller achieves a comparable level of performance to a manually-tuned PID controller, despite not depending on any model information. The paper considers different choices of reward function and their influence on controller performance.

Download Full-text

Application of multilayer perceptron to deep reinforcement learning for stock market trading and analysis

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v24.i3.pp1759-1771 ◽

2021 ◽

Vol 24 (3) ◽

pp. 1759

Author(s):

Hima Keerthi Sagiraju ◽

Shashi Mogalla

Keyword(s):

Stock Market ◽

Multilayer Perceptron ◽

Predictive Performance ◽

Q Learning ◽

Policy Gradient ◽

The Individual ◽

Policy Optimization ◽

Stock Market Trading ◽

Parameter Values ◽

Sortino Ratio

Trading strategies to maximize profits by tracking and responding to dynamic stock market variations is a complex task. This paper proposes to use a multilayer perceptron method (a part of artificial neural networks (ANNs)), that can be used to deploy deep reinforcement strategies to learn the process of predicting and analyzing the stock market products with the aim to maximize profit making. We trained a deep reinforcement agent using the four algorithms: proximal policy optimization (PPO), deep Q-learning (DQN), deep deterministic policy gradient (DDPG) method, and advantage actor critic (A2C). The proposed system, comprising these algorithms, is tested using real time stock data of two products: Dow Jones (DJIA-index), and Qualcomm (shares). The performance of the agent linked to the individual algorithms was evaluated, compared and analyzed using Sharpe ratio, Sortino ratio, Skew and Kurtosis, thus leading to the most effective algorithm being chosen. Based on the parameter values, the algorithm that maximizes profit making for the respective financial product was determined. We also extended the same approach to study and ascertain the predictive performance of the algorithms on trading under highly volatile scenario, such as the pandemic coronavirus disease 2019 (COVID-19).

Download Full-text

Extreme Trust Region Policy Optimization for Active Object Recognition

IEEE Transactions on Neural Networks and Learning Systems ◽

10.1109/tnnls.2017.2785233 ◽

2018 ◽

Vol 29 (6) ◽

pp. 2253-2258 ◽

Cited By ~ 9

Author(s):

Huaping Liu ◽

Yupei Wu ◽

Fuchun Sun

Keyword(s):

Object Recognition ◽

Trust Region ◽

Active Object ◽

Policy Optimization ◽

Active Object Recognition

Download Full-text

Accelerating the training of deep reinforcement learning in autonomous driving

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i3.pp649-656 ◽

2021 ◽

Vol 10 (3) ◽

pp. 649

Author(s):

Emmanuel Ifeanyi Iroegbu ◽

Devaraj Madhavi

Keyword(s):

Reinforcement Learning ◽

Autonomous Vehicle ◽

Autonomous Driving ◽

High Dimensional ◽

Training Time ◽

Learning Agent ◽

Policy Gradient ◽

Low Dimensional ◽

Policy Optimization

Deep reinforcement learning has been successful in solving common autonomous driving tasks such as lane-keeping by simply using pixel data from the front view camera as input. However, raw pixel data contains a very high-dimensional observation that affects the learning quality of the agent due to the complexity imposed by a 'realistic' urban environment. Ergo, we investigate how compressing the raw pixel data from high-dimensional state to low-dimensional latent space offline using a variational autoencoder can significantly improve the training of a deep reinforcement learning agent. We evaluated our method on a simulated autonomous vehicle in car learning to act and compared our results with many baselines including deep deterministic policy gradient, proximal policy optimization, and soft actorcritic. The result shows that the method greatly accelerates the training time and there was a remarkable improvement in the quality of the deep reinforcement learning agent.

Download Full-text

Value-free reinforcement learning: Policy optimization as a minimal model of operant behavior

10.31234/osf.io/ew58m ◽

2021 ◽

Author(s):

Daniel Bennett ◽

Yael Niv ◽

Angela Langdon

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Operant Behavior ◽

Intermediate Step ◽

Learning Models ◽

Policy Gradient ◽

Reinforcement Learning Models ◽

Policy Optimization ◽

Value Learning ◽

Gradient Models

Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses value-based reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a behavioral policy directly, without the intermediate step of value-learning. Here we review recent behavioral and neural findings that are more parsimoniously explained by policy-gradient models than by value-based models. We conclude that, despite the ubiquity of `value' in reinforcement-learning models of decision making, policy-gradient models provide a lightweight and compelling alternative model of operant behavior.

Download Full-text

Multiagent Simulation On Hide and Seek Games Using Policy Gradient Trust Region Policy Optimization

PQROM: To optimize software defined network QoS-aware routing with proximal policy optimization

Reinforcement learning-based radar-evasive path planning: a comparative analysis

Learning Quadcopter Maneuvers with Concurrent Methods of Policy Optimization

Hindsight Trust Region Policy Optimization

A Deep Reinforcement Learning Based Car Following Model for Electric Vehicle

Quadrotor Motion Control Using Deep Reinforcement Learning

Application of multilayer perceptron to deep reinforcement learning for stock market trading and analysis

Extreme Trust Region Policy Optimization for Active Object Recognition

Accelerating the training of deep reinforcement learning in autonomous driving

Value-free reinforcement learning: Policy optimization as a minimal model of operant behavior

Export Citation Format