Multiagent Simulation On Hide and Seek Games Using Policy Gradient Trust Region Policy Optimization

Author(s):  
Hani'ah Wafa ◽  
Judhi Santoso
2021 ◽  
pp. 1-10
Author(s):  
Wei Zhou ◽  
Xing Jiang ◽  
Bingli Guo (Member, IEEE) ◽  
Lingyu Meng

Currently, Quality-of-Service (QoS)-aware routing is one of the crucial challenges in Software Defined Network (SDN). The QoS performances, e.g. latency, packet loss ratio and throughput, must be optimized to improve the performance of network. Traditional static routing algorithms based on Open Shortest Path First (OSPF) could not adapt to traffic fluctuation, which may cause severe network congestion and service degradation. Central intelligence of SDN controller and recent breakthroughs of Deep Reinforcement Learning (DRL) pose a promising solution to tackle this challenge. Thus, we propose an on-policy DRL mechanism, namely the PPO-based (Proximal Policy Optimization) QoS-aware Routing Optimization Mechanism (PQROM), to achieve a general and re-customizable routing optimization. PQROM can dynamically update the routing calculation by adjusting the reward function according to different optimization objectives, and it is independent of any specific network pattern. Additionally, as a black-box one-step optimization, PQROM is qualified for both continuous and discrete action space with high-dimensional input and output. The OMNeT ++ simulation experiment results show that PQROM not only has good convergence, but also has better stability compared with OSPF, less training time and simpler hyper-parameters adjustment than Deep Deterministic Policy Gradient (DDPG) and less hardware consumption than Asynchronous Advantage Actor-Critic (A3C).


2021 ◽  
pp. 1-18
Author(s):  
R.U. Hameed ◽  
A. Maqsood ◽  
A.J. Hashmi ◽  
M.T. Saeed ◽  
R. Riaz

Abstract This paper discusses the utilisation of deep reinforcement learning algorithms to obtain optimal paths for an aircraft to avoid or minimise radar detection and tracking. A modular approach is adopted to formulate the problem, including the aircraft kinematics model, aircraft radar cross-section model and radar tracking model. A virtual environment is designed for single and multiple radar cases to obtain optimal paths. The optimal trajectories are generated through deep reinforcement learning in this study. Specifically, three algorithms, namely deep deterministic policy gradient, trust region policy optimisation and proximal policy optimisation, are used to find optimal paths for five test cases. The comparison is carried out based on six performance indicators. The investigation proves the importance of these reinforcement learning algorithms in optimal path planning. The results indicate that the proximal policy optimisation approach performed better for optimal paths in general.


Author(s):  
Pei-Hua Huang ◽  
◽  
Osamu Hasegawa

This study presents an aerial robotic application of deep reinforcement learning that imparts an asynchronous learning framework and trust region policy optimization to a simulated quad-rotor helicopter (quadcopter) environment. In particular, we optimized a control policy asynchronously through interaction with concurrent instances of the environment. The control system was benchmarked and extended with examples to tackle continuous state-action tasks for the quadcoptor: hovering control and balancing an inverted pole. Performing these maneuvers required continuous actions for sensitive control of small acceleration changes of the quadcoptor, thereby maximizing the scalar reward of the defined tasks. The simulation results demonstrated an enhancement of the learning speed and reliability for the tasks.


Author(s):  
Hanbo Zhang ◽  
Site Bai ◽  
Xuguang Lan ◽  
David Hsu ◽  
Nanning Zheng

Reinforcement Learning (RL) with sparse rewards is a major challenge. We pro- pose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with hindsight to tackle the challenge of sparse rewards. Hindsight refers to the algorithm’s ability to learn from information across goals, including past goals not intended for the current task. We derive the hindsight form of TRPO, together with QKL, a quadratic approximation to the KL divergence constraint on the trust region. QKL reduces variance in KL divergence estimation and improves stability in policy updates. We show that HTRPO has similar convergence property as TRPO. We also present Hindsight Goal Filtering (HGF), which further improves the learning performance for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks, including Atari games and simulated robot control. Experimental results show that HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy 14 gradient algorithm for RL with sparse rewards.


2019 ◽  
Vol 2 (5) ◽  
Author(s):  
Yuankai Wu ◽  
Huachun Tan ◽  
Jiankun Peng ◽  
Bin Ran

Car following (CF) models are an appealing research area because they fundamentally describe longitudinal interactions of vehicles on the road, and contribute significantly to an understanding of traffic flow. There is an emerging trend to use data-driven method to build CF models. One challenge to the data-driven CF models is their capability to achieve optimal longitudinal driven behavior because a lot of bad driving behaviors will be learnt from human drivers by the supervised learning manner. In this study, by utilizing the deep reinforcement learning (DRL) techniques trust region policy optimization (TRPO), a DRL based CF model for electric vehicle (EV) is built. The proposed CF model can learn optimal driving behavior by itself in simulation. The experiments on following standard driving cycle show that the DRL model outperforms the traditional CF model in terms of electricity consumption.


Author(s):  
Zifei Jiang ◽  
Alan F. Lynch

We present a deep neural net-based controller trained by a model-free reinforcement learning (RL) algorithm to achieve hover stabilization for a quadrotor unmanned aerial vehicle (UAV). With RL, two neural nets are trained. One neural net is used as a stochastic controller which gives the distribution of control inputs. The other maps the UAV state to a scalar which estimates the reward of the controller. A proximal policy optimization (PPO) method, which is an actor-critic policy gradient approach, is used to train the neural nets. Simulation results show that the trained controller achieves a comparable level of performance to a manually-tuned PID controller, despite not depending on any model information. The paper considers different choices of reward function and their influence on controller performance.


Author(s):  
Hima Keerthi Sagiraju ◽  
Shashi Mogalla

Trading strategies to maximize profits by tracking and responding to dynamic stock market variations is a complex task. This paper proposes to use a multilayer perceptron method (a part of artificial neural networks (ANNs)), that can be used to deploy deep reinforcement strategies to learn the process of predicting and analyzing the stock market products with the aim to maximize profit making. We trained a deep reinforcement agent using the four algorithms: proximal policy optimization (PPO), deep Q-learning (DQN), deep deterministic policy gradient (DDPG) method, and advantage actor critic (A2C). The proposed system, comprising these algorithms, is tested using real time stock data of two products: Dow Jones (DJIA-index), and Qualcomm (shares). The performance of the agent linked to the individual algorithms was evaluated, compared and analyzed using Sharpe ratio, Sortino ratio, Skew and Kurtosis, thus leading to the most effective algorithm being chosen. Based on the parameter values, the algorithm that maximizes profit making for the respective financial product was determined. We also extended the same approach to study and ascertain the predictive performance of the algorithms on trading under highly volatile scenario, such as the pandemic coronavirus disease 2019 (COVID-19).


Author(s):  
Emmanuel Ifeanyi Iroegbu ◽  
Devaraj Madhavi

Deep reinforcement learning has been successful in solving common autonomous driving tasks such as lane-keeping by simply using pixel data from the front view camera as input. However, raw pixel data contains a very high-dimensional observation that affects the learning quality of the agent due to the complexity imposed by a 'realistic' urban environment. Ergo, we investigate how compressing the raw pixel data from high-dimensional state to low-dimensional latent space offline using a variational autoencoder can significantly improve the training of a deep reinforcement learning agent. We evaluated our method on a simulated autonomous vehicle in car learning to act and compared our results with many baselines including deep deterministic policy gradient, proximal policy optimization, and soft actorcritic. The result shows that the method greatly accelerates the training time and there was a remarkable improvement in the quality of the deep reinforcement learning agent.


2021 ◽  
Author(s):  
Daniel Bennett ◽  
Yael Niv ◽  
Angela Langdon

Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses value-based reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a behavioral policy directly, without the intermediate step of value-learning. Here we review recent behavioral and neural findings that are more parsimoniously explained by policy-gradient models than by value-based models. We conclude that, despite the ubiquity of `value' in reinforcement-learning models of decision making, policy-gradient models provide a lightweight and compelling alternative model of operant behavior.


Sign in / Sign up

Export Citation Format

Share Document