Reward-Weighted Regression with Sample Reuse for Direct Policy Search in Reinforcement Learning

Direct policy search is a promising reinforcement learning framework, in particular for controlling continuous, high-dimensional systems. Policy search often requires a large number of samples for obtaining a stable policy update estimator, and this is prohibitive when the sampling cost is expensive. In this letter, we extend an expectation-maximization-based policy search method so that previously collected samples can be efficiently reused. The usefulness of the proposed method, reward-weighted regression with sample reuse (R[Formula: see text]), is demonstrated through robot learning experiments. (This letter is an extended version of our earlier conference paper: Hachiya, Peters, & Sugiyama, 2009 .)

Download Full-text

Direct Policy Search Reinforcement Learning Based on Variational Bayesian Inference

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2020.p0711 ◽

2020 ◽

Vol 24 (6) ◽

pp. 711-718

Author(s):

Nobuhiko Yamaguchi ◽

Keyword(s):

Reinforcement Learning ◽

Bayesian Inference ◽

Expectation Maximization ◽

Weighted Regression ◽

High Dimensional ◽

Policy Search ◽

Variational Bayesian ◽

Learning Framework ◽

Average Return ◽

Variational Bayesian Inference

Direct policy search is a promising reinforcement learning framework particularly for controlling continuous, high-dimensional systems. Peters et al. proposed reward-weighted regression (RWR) as a direct policy search. The RWR algorithm estimates the policy parameter based on the expectation-maximization (EM) algorithm and is therefore prone to overfitting. In this study, we focus on variational Bayesian inference to avoid overfitting and propose direct policy search reinforcement learning based on variational Bayesian inference (VBRL). The performance of the proposed VBRL is assessed in several experiments involving a mountain car and a ball batting task. These experiments demonstrate that VBRL yields a higher average return and outperforms the RWR.

Download Full-text

A Policy Search Method For Temporal Logic Specified Reinforcement Learning Tasks

2018 Annual American Control Conference (ACC) ◽

10.23919/acc.2018.8431181 ◽

2018 ◽

Cited By ~ 9

Author(s):

Xiao Li ◽

Yao Ma ◽

Calin Belta

Keyword(s):

Reinforcement Learning ◽

Temporal Logic ◽

Search Method ◽

Policy Search ◽

Learning Tasks

Download Full-text

Nonconvex Policy Search Using Variational Inequalities

Neural Computation ◽

10.1162/neco_a_01004 ◽

2017 ◽

Vol 29 (10) ◽

pp. 2800-2824

Author(s):

Yusen Zhan ◽

Haitham Bou Ammar ◽

Matthew E. Taylor

Keyword(s):

Reinforcement Learning ◽

Dynamical Systems ◽

Variational Inequalities ◽

Learning Algorithms ◽

High Dimensional ◽

Control Problems ◽

Policy Search ◽

Optimal Policies ◽

Stochastic Setting ◽

First Time

Policy search is a class of reinforcement learning algorithms for finding optimal policies in control problems with limited feedback. These methods have been shown to be successful in high-dimensional problems such as robotics control. Though successful, current methods can lead to unsafe policy parameters that potentially could damage hardware units. Motivated by such constraints, we propose projection-based methods for safe policies. These methods, however, can handle only convex policy constraints. In this letter, we propose the first safe policy search reinforcement learner capable of operating under nonconvex policy constraints. This is achieved by observing, for the first time, a connection between nonconvex variational inequalities and policy search problems. We provide two algorithms, Mann and two-step iteration, to solve the above problems and prove convergence in the nonconvex stochastic setting. Finally, we demonstrate the performance of the algorithms on six benchmark dynamical systems and show that our new method is capable of outperforming previous methods under a variety of settings.

Download Full-text

HRLB⌃2: A Reinforcement Learning Based Framework for Believable Bots

Applied Sciences ◽

10.3390/app8122453 ◽

2018 ◽

Vol 8 (12) ◽

pp. 2453 ◽

Cited By ~ 5

Author(s):

Christian Arzate Cruz ◽

Jorge Ramirez Uresti

Keyword(s):

Reinforcement Learning ◽

High Dimensional ◽

State Action ◽

Hierarchical Reinforcement Learning ◽

Learning Framework ◽

Novel Approach ◽

The Creation ◽

Action Spaces ◽

Human Player

The creation of believable behaviors for Non-Player Characters (NPCs) is key to improve the players’ experience while playing a game. To achieve this objective, we need to design NPCs that appear to be controlled by a human player. In this paper, we propose a hierarchical reinforcement learning framework for believable bots (HRLB⌃2). This novel approach has been designed so it can overcome two main challenges currently faced in the creation of human-like NPCs. The first difficulty is exploring domains with high-dimensional state–action spaces, while satisfying constraints imposed by traits that characterize human-like behavior. The second problem is generating behavior diversity, by also adapting to the opponent’s playing style. We evaluated the effectiveness of our framework in the domain of the 2D fighting game named Street Fighter IV. The results of our tests demonstrate that our bot behaves in a human-like manner.

Download Full-text

Expectation-maximization Policy Search with Parameter-based Exploration

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2012.00038 ◽

2012 ◽

Vol 38 (1) ◽

pp. 38-45

Author(s):

Yu-Hu CHENG ◽

Huan-Ting FENG ◽

Xue-Song WANG

Keyword(s):

Expectation Maximization ◽

Policy Search

Download Full-text

Hierarchical Reinforcement Learning Framework for Secure UAV Communication in the Presence of Multiple UAV Adaptive Eavesdroppers

2020 IEEE 6th International Conference on Computer and Communications (ICCC) ◽

10.1109/iccc51575.2020.9344970 ◽

2020 ◽

Author(s):

Liu Jue ◽

Yang Weiwei

Keyword(s):

Reinforcement Learning ◽

Hierarchical Reinforcement Learning ◽

Learning Framework

Download Full-text

Reinforcement Learning based Evolutionary Metric Filtering for High Dimensional Problems

2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA) ◽

10.1109/icmla51294.2020.00045 ◽

2020 ◽

Author(s):

Bassel Ali ◽

Koichi Moriyama ◽

Masayuki Numao ◽

Ken-ichi Fukui

Keyword(s):

Reinforcement Learning ◽

High Dimensional

Download Full-text

A Reinforcement Learning Framework for Spiking Networks with Dynamic Synapses

Computational Intelligence and Neuroscience ◽

10.1155/2011/869348 ◽

2011 ◽

Vol 2011 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Karim El-Laithy ◽

Martin Bogdan

Keyword(s):

Reinforcement Learning ◽

Spike Timing ◽

Neural Representation ◽

Model Parameters ◽

Learning Framework ◽

Reference Target ◽

Wide Range ◽

Spiking Network ◽

Dynamic Synapses ◽

Exclusive Or

An integration of both the Hebbian-based and reinforcement learning (RL) rules is presented for dynamic synapses. The proposed framework permits the Hebbian rule to update the hidden synaptic model parameters regulating the synaptic response rather than the synaptic weights. This is performed using both the value and the sign of the temporal difference in the reward signal after each trial. Applying this framework, a spiking network with spike-timing-dependent synapses is tested to learn the exclusive-OR computation on a temporally coded basis. Reward values are calculated with the distance between the output spike train of the network and a reference target one. Results show that the network is able to capture the required dynamics and that the proposed framework can reveal indeed an integrated version of Hebbian and RL. The proposed framework is tractable and less computationally expensive. The framework is applicable to a wide class of synaptic models and is not restricted to the used neural representation. This generality, along with the reported results, supports adopting the introduced approach to benefit from the biologically plausible synaptic models in a wide range of intuitive signal processing.

Download Full-text

Adaptive Reinforcement Learning Framework for NOMA-UAV Networks

IEEE Communications Letters ◽

10.1109/lcomm.2021.3093385 ◽

2021 ◽

pp. 1-1

Author(s):

Syed Khurram Mahmud ◽

Yuanwei Liu ◽

Yue Chen ◽

Kok Keong Chai

Keyword(s):

Reinforcement Learning ◽

Learning Framework

Download Full-text

Risky Driver Recognition with Class Imbalance Data and Automated Machine Learning Framework

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18147534 ◽

2021 ◽

Vol 18 (14) ◽

pp. 7534

Author(s):

Ke Wang ◽

Qingwen Xue ◽

Jian John Lu

Keyword(s):

Machine Learning ◽

High Risk ◽

Loss Function ◽

Class Imbalance ◽

Support Vector ◽

Trajectory Data ◽

Recognition Model ◽

Learning Framework ◽

Sampling Cost ◽

Automated Machine Learning

Identifying high-risk drivers before an accident happens is necessary for traffic accident control and prevention. Due to the class-imbalance nature of driving data, high-risk samples as the minority class are usually ill-treated by standard classification algorithms. Instead of applying preset sampling or cost-sensitive learning, this paper proposes a novel automated machine learning framework that simultaneously and automatically searches for the optimal sampling, cost-sensitive loss function, and probability calibration to handle class-imbalance problem in recognition of risky drivers. The hyperparameters that control sampling ratio and class weight, along with other hyperparameters, are optimized by Bayesian optimization. To demonstrate the performance of the proposed automated learning framework, we establish a risky driver recognition model as a case study, using video-extracted vehicle trajectory data of 2427 private cars on a German highway. Based on rear-end collision risk evaluation, only 4.29% of all drivers are labeled as risky drivers. The inputs of the recognition model are the discrete Fourier transform coefficients of target vehicle’s longitudinal speed, lateral speed, and the gap between the target vehicle and its preceding vehicle. Among 12 sampling methods, 2 cost-sensitive loss functions, and 2 probability calibration methods, the result of automated machine learning is consistent with manual searching but much more computation-efficient. We find that the combination of Support Vector Machine-based Synthetic Minority Oversampling TEchnique (SVMSMOTE) sampling, cost-sensitive cross-entropy loss function, and isotonic regression can significantly improve the recognition ability and reduce the error of predicted probability.

Download Full-text