Model-Based Reinforcement Learning for Infinite-Horizon Discounted Constrained Markov Decision Processes

In many real-world reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinite-horizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two model-based constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GM-CRL algorithm, where the algorithm has access to a generative model, and (ii) UC-CRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction.

Download Full-text

Adiabatic Markov Decision Process: Convergence of Value Iteration Algorithm

Journal of Dynamic Systems Measurement and Control ◽

10.1115/1.4032875 ◽

2016 ◽

Vol 138 (6) ◽

Author(s):

Thai Duong ◽

Duong Nguyen-Huu ◽

Thinh Nguyen

Keyword(s):

Markov Decision Process ◽

Decision Process ◽

Transition Probability ◽

Transition Probability Matrix ◽

Rate Of Change ◽

Optimal Decision ◽

Iteration Algorithm ◽

Value Iteration ◽

Markov Decision ◽

Value Iteration Algorithm

Markov decision process (MDP) is a well-known framework for devising the optimal decision-making strategies under uncertainty. Typically, the decision maker assumes a stationary environment which is characterized by a time-invariant transition probability matrix. However, in many real-world scenarios, this assumption is not justified, thus the optimal strategy might not provide the expected performance. In this paper, we study the performance of the classic value iteration algorithm for solving an MDP problem under nonstationary environments. Specifically, the nonstationary environment is modeled as a sequence of time-variant transition probability matrices governed by an adiabatic evolution inspired from quantum mechanics. We characterize the performance of the value iteration algorithm subject to the rate of change of the underlying environment. The performance is measured in terms of the convergence rate to the optimal average reward. We show two examples of queuing systems that make use of our analysis framework.

Download Full-text

AUTOMATED VULNERABILITY SEARCH IN A WEB APPLICATION BASED ON REINFORCEMENT LEARNING

CASPIAN JOURNAL Control and High Technologies ◽

10.21672/2074-1707.2021.53.1.091-097 ◽

2021 ◽

Vol 53 (1) ◽

pp. 91-97

Author(s):

OLGA N. VYBORNOVA ◽

◽

ALEKSANDER N. RYZHIKOV ◽

Keyword(s):

Reinforcement Learning ◽

Web Application ◽

Web Applications ◽

Subject Area ◽

Learning Technology ◽

Web Application Security ◽

Vulnerability Scanner ◽

Learning Agent ◽

Markov Decision ◽

Python Programming

We analyzed the urgency of the task of creating a more efficient (compared to analogues) means of automated vulnerability search based on modern technologies. We have shown the similarity of the vulnerabilities identifying process with the Markov decision-making process and justified the feasibility of using reinforcement learning technology for solving this problem. Since the analysis of the web application security is currently the highest priority and in demand, within the framework of this work, the application of the mathematical apparatus of reinforcement learning with to this subject area is considered. The mathematical model is presented, the specifics of the training and testing processes for the problem of automated vulnerability search in web applications are described. Based on an analysis of the OWASP Testing Guide, an action space and a set of environment states are identified. The characteristics of the software implementation of the proposed model are described: Q-learning is implemented in the Python programming language; a neural network was created to implement the learning policy using the tensorflow library. We demonstrated the results of the Reinforcement Learning agent on a real web application, as well as their comparison with the report of the Acunetix Vulnerability Scanner. The findings indicate that the proposed solution is promising.

Download Full-text

A Hybrid Neural Network Model Based Reinforcement Learning Agent

Advances in Neural Networks - ISNN 2010 - Lecture Notes in Computer Science ◽

10.1007/978-3-642-13278-0_56 ◽

2010 ◽

pp. 436-443

Author(s):

Pengyi Gao ◽

Chuanbo Chen ◽

Kui Zhang ◽

Yingsong Hu ◽

Dan Li

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Network Model ◽

Neural Network Model ◽

Hybrid Neural Network ◽

Model Based ◽

Learning Agent

Download Full-text

Model-Based Safe Reinforcement Learning with Time-Varying State and Control Constraints: An Application to Intelligent Vehicles

10.36227/techrxiv.17205740.v2 ◽

2021 ◽

Author(s):

Xinglong Zhang ◽

Yaoqian Peng ◽

Biao Luo ◽

Wei Pan ◽

Xin Xu ◽

...

Keyword(s):

Optimal Control ◽

Reinforcement Learning ◽

Control Policy ◽

Intelligent Vehicles ◽

Time Varying ◽

Control Constraints ◽

Model Based ◽

Safety Constraints ◽

And Control ◽

State And Control Constraints

<div>Recently, barrier function-based safe reinforcement learning (RL) with the actor-critic structure for continuous control tasks has received increasing attention. It is still challenging to learn a near-optimal control policy with safety and convergence guarantees. Also, few works have addressed the safe RL algorithm design under time-varying safety constraints. This paper proposes a model-based safe RL algorithm for optimal control of nonlinear systems with time-varying state and control constraints. In the proposed approach, we construct a novel barrier-based control policy structure that can guarantee control safety. A multi-step policy evaluation mechanism is proposed to predict the policy's safety risk under time-varying safety constraints and guide the policy to update safely. Theoretical results on stability and robustness are proven. Also, the convergence of the actor-critic learning algorithm is analyzed. The performance of the proposed algorithm outperforms several state-of-the-art RL algorithms in the simulated Safety Gym environment. Furthermore, the approach is applied to the integrated path following and collision avoidance problem for two real-world intelligent vehicles. A differential-drive vehicle and an Ackermann-drive one are used to verify the offline deployment performance and the online learning performance, respectively. Our approach shows an impressive sim-to-real transfer capability and a satisfactory online control performance in the experiment.</div>

Download Full-text

Contracts for Difference: A Reinforcement Learning Approach

Journal of Risk and Financial Management ◽

10.3390/jrfm13040078 ◽

2020 ◽

Vol 13 (4) ◽

pp. 78

Author(s):

Nico Zengeler ◽

Uwe Handmann

Keyword(s):

Reinforcement Learning ◽

Short Term Memory ◽

Learning Agents ◽

Learning Framework ◽

Learning Agent ◽

Markov Decision ◽

Economic Trends ◽

Model Size ◽

Contracts For Difference ◽

Partially Observable

We present a deep reinforcement learning framework for an automatic trading of contracts for difference (CfD) on indices at a high frequency. Our contribution proves that reinforcement learning agents with recurrent long short-term memory (LSTM) networks can learn from recent market history and outperform the market. Usually, these approaches depend on a low latency. In a real-world example, we show that an increased model size may compensate for a higher latency. As the noisy nature of economic trends complicates predictions, especially in speculative assets, our approach does not predict courses but instead uses a reinforcement learning agent to learn an overall lucrative trading policy. Therefore, we simulate a virtual market environment, based on historical trading data. Our environment provides a partially observable Markov decision process (POMDP) to reinforcement learners and allows the training of various strategies.

Download Full-text

Design Synthesis through a Markov Decision Process and Reinforcement Learning Framework

Journal of Computing and Information Science in Engineering ◽

10.1115/1.4051598 ◽

2021 ◽

pp. 1-19

Author(s):

Maximilian Ororbia ◽

Gordon P. Warn

Keyword(s):

Reinforcement Learning ◽

Optimal Design ◽

Markov Decision Process ◽

Decision Process ◽

Plastic Material ◽

Cross Sectional ◽

Design Synthesis ◽

Learning Agent ◽

Markov Decision ◽

Elastic Plastic Material

Abstract This paper presents a framework that mathematically models optimal design synthesis as a Markov Decision Process that is solved with reinforcement learning. In this context, the states correspond to specific design configurations, the actions correspond to the available alterations modeled after generative design grammars, and the immediate rewards are constructed to be related to the improvement in the altered configuration's performance with respect to the design objective. Since in the context of optimal design synthesis the immediate rewards are in general not known at the onset of the process, reinforcement learning is employed to efficiently solve the MDP. The goal of the reinforcement learning agent is to maximize the cumulative rewards and hence synthesize the best performing or optimal design. The framework is demonstrated for the optimization of planar trusses with binary cross-sectional areas, and its utility is investigated with four numerical examples, each with a unique combination of domain, constraint, and external force(s) considering both linear-elastic and elastic-plastic material behaviors. The design solutions obtained with the framework are also compared with other methods in order to demonstrate its efficiency and accuracy.

Download Full-text

Model-Based Reinforcement Learning for Infinite-Horizon Approximate Optimal Tracking

IEEE Transactions on Neural Networks and Learning Systems ◽

10.1109/tnnls.2015.2511658 ◽

2017 ◽

Vol 28 (3) ◽

pp. 753-758 ◽

Cited By ~ 41

Author(s):

Rushikesh Kamalapurkar ◽

Lindsey Andrews ◽

Patrick Walters ◽

Warren E. Dixon

Keyword(s):

Reinforcement Learning ◽

Infinite Horizon ◽

Optimal Tracking ◽

Model Based

Download Full-text

Object Affordance Driven Inverse Reinforcement Learning Through Conceptual Abstraction and Advice

Paladyn Journal of Behavioral Robotics ◽

10.1515/pjbr-2018-0021 ◽

2018 ◽

Vol 9 (1) ◽

pp. 277-294 ◽

Cited By ~ 1

Author(s):

Rupam Bhattacharyya ◽

Shyamanta M. Hazarika

Keyword(s):

Reinforcement Learning ◽

High Dimensional ◽

Inverse Reinforcement Learning ◽

Intent Recognition ◽

Reward Function ◽

Object Affordances ◽

Learning Agent ◽

Markov Decision ◽

Observed Behaviour ◽

Object Affordance

Abstract Within human Intent Recognition (IR), a popular approach to learning from demonstration is Inverse Reinforcement Learning (IRL). IRL extracts an unknown reward function from samples of observed behaviour. Traditional IRL systems require large datasets to recover the underlying reward function. Object affordances have been used for IR. Existing literature on recognizing intents through object affordances fall short of utilizing its true potential. In this paper, we seek to develop an IRL system which drives human intent recognition along with the capability to handle high dimensional demonstrations exploiting the capability of object affordances. An architecture for recognizing human intent is presented which consists of an extended Maximum Likelihood Inverse Reinforcement Learning agent. Inclusion of Symbolic Conceptual Abstraction Engine (SCAE) along with an advisor allows the agent to work on Conceptually Abstracted Markov Decision Process. The agent recovers object affordance based reward function from high dimensional demonstrations. This function drives a Human Intent Recognizer through identification of probable intents. Performance of the resulting system on the standard CAD-120 dataset shows encouraging result.

Download Full-text

Model-based two-way clustering of second-level units in ordinal multilevel latent Markov models

Advances in Data Analysis and Classification ◽

10.1007/s11634-021-00446-7 ◽

2021 ◽

Author(s):

Giorgio Eduardo Montanari ◽

Marco Doretti ◽

Maria Francesca Marino

Keyword(s):

Random Effects ◽

Markov Models ◽

Transition Probability ◽

Random Effect ◽

Transition Probability Matrix ◽

Probability Vector ◽

Physical Health Status ◽

Model Based ◽

Latent Markov Model ◽

Special Case

AbstractIn this paper, an ordinal multilevel latent Markov model based on separate random effects is proposed. In detail, two distinct second-level discrete effects are considered in the model, one affecting the initial probability vector and the other affecting the transition probability matrix of the first-level ordinal latent Markov process. To model these separate effects, we consider a bi-dimensional mixture specification that allows to avoid unverifiable assumptions on the random effect distribution and to derive a two-way clustering of second-level units. Starting from a general model where the two random effects are dependent, we also obtain the independence model as a special case. The proposal is applied to data on the physical health status of a sample of elderly residents grouped into nursing homes. A simulation study assessing the performance of the proposal is also included.

Download Full-text