Efficient PAC Reinforcement Learning in Regular Decision Processes

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/279 ◽

2021 ◽

Author(s):

Alessandro Ronca ◽

Giuseppe De Giacomo

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Polynomial Time ◽

Optimal Policy ◽

Decision Process ◽

Transition Function ◽

Decision Processes ◽

Reward Function ◽

Markov Decision ◽

Reward Functions

Recently regular decision processes have been proposed as a well-behaved form of non-Markov decision process. Regular decision processes are characterised by a transition function and a reward function that depend on the whole history, though regularly (as in regular languages). In practice both the transition and the reward functions can be seen as finite transducers. We study reinforcement learning in regular decision processes. Our main contribution is to show that a near-optimal policy can be PAC-learned in polynomial time in a set of parameters that describe the underlying decision process. We argue that the identified set of parameters is minimal and it reasonably captures the difficulty of a regular decision process.

Download Full-text

FLOW SHOP SCHEDULING WITH REINFORCEMENT LEARNING

Asia Pacific Journal of Operational Research ◽

10.1142/s0217595913500140 ◽

2013 ◽

Vol 30 (05) ◽

pp. 1350014 ◽

Cited By ~ 2

Author(s):

ZHICONG ZHANG ◽

WEIPING WANG ◽

SHOUYAN ZHONG ◽

KAISHUN HU

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Large Scale ◽

Flow Shop ◽

Flow Shop Scheduling ◽

Scheduling Problems ◽

Shop Scheduling ◽

Reward Function ◽

Markov Decision

Reinforcement learning (RL) is a state or action value based machine learning method which solves large-scale multi-stage decision problems such as Markov Decision Process (MDP) and Semi-Markov Decision Process (SMDP) problems. We minimize the makespan of flow shop scheduling problems with an RL algorithm. We convert flow shop scheduling problems into SMDPs by constructing elaborate state features, actions and the reward function. Minimizing the accumulated reward is equivalent to minimizing the schedule objective function. We apply on-line TD(λ) algorithm with linear gradient-descent function approximation to solve the SMDPs. To examine the performance of the proposed RL algorithm, computational experiments are conducted on benchmarking problems in comparison with other scheduling methods. The experimental results support the efficiency of the proposed algorithm and illustrate that the RL approach is a promising computational approach for flow shop scheduling problems worthy of further investigation.

Download Full-text

Safe and Sample-Efficient Reinforcement Learning Algorithms for Factored Environments

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/919 ◽

2019 ◽

Author(s):

Thiago D. Simão

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Learning Algorithms ◽

Transition Function ◽

High Confidence ◽

Policy Improvement ◽

Markov Decision ◽

Improvement Method ◽

Better Than

Reinforcement Learning (RL) deals with problems that can be modeled as a Markov Decision Process (MDP) where the transition function is unknown. In situations where an arbitrary policy pi is already in execution and the experiences with the environment were recorded in a batch D, an RL algorithm can use D to compute a new policy pi'. However, the policy computed by traditional RL algorithms might have worse performance compared to pi. Our goal is to develop safe RL algorithms, where the agent has a high confidence that the performance of pi' is better than the performance of pi given D. To develop sample-efficient and safe RL algorithms we combine ideas from exploration strategies in RL with a safe policy improvement method.

Download Full-text

Adaptive control of M/M/1 queues—continuous-time Markov decision process approach

Journal of Applied Probability ◽

10.1017/s0021900200023512 ◽

1983 ◽

Vol 20 (02) ◽

pp. 368-379

Author(s):

Lam Yeh ◽

L. C. Thomas

Keyword(s):

Adaptive Control ◽

Markov Decision Process ◽

Markov Decision Processes ◽

Optimal Policy ◽

Continuous Time ◽

Decision Process ◽

Process Approach ◽

Decision Processes ◽

Markov Decision ◽

Discounted Costs

By considering continuous-time Markov decision processes where decisions can be made at any time, we show in the case of M/M/1 queues with discounted costs that there exists a monotone optimal policy among all the regular policies.

Download Full-text

Adaptive control of M/M/1 queues—continuous-time Markov decision process approach

Journal of Applied Probability ◽

10.2307/3213809 ◽

1983 ◽

Vol 20 (2) ◽

pp. 368-379 ◽

Cited By ~ 6

Author(s):

Lam Yeh ◽

L. C. Thomas

Keyword(s):

Adaptive Control ◽

Markov Decision Process ◽

Markov Decision Processes ◽

Optimal Policy ◽

Continuous Time ◽

Decision Process ◽

Process Approach ◽

Decision Processes ◽

Markov Decision ◽

Discounted Costs

Download Full-text

A Moreau-Yosida regularization for Markov decision processes

Proyecciones (Antofagasta) ◽

10.22199/issn.0717-6279-2021-01-0008 ◽

2020 ◽

Vol 40 (1) ◽

pp. 117-137

Author(s):

R. Israel Ortega-Gutiérrez ◽

H. Cruz-Suárez

Keyword(s):

Markov Decision Process ◽

Markov Decision Processes ◽

Optimal Policy ◽

Decision Process ◽

Value Function ◽

Decision Processes ◽

Original Process ◽

Optimal Value ◽

Markov Decision ◽

Yosida Regularization

This paper addresses a class of sequential optimization problems known as Markov decision processes. These kinds of processes are considered on Euclidean state and action spaces with the total expected discounted cost as the objective function. The main goal of the paper is to provide conditions to guarantee an adequate Moreau-Yosida regularization for Markov decision processes (named the original process). In this way, a new Markov decision process that conforms to the Markov control model of the original process except for the cost function induced via the Moreau-Yosida regularization is established. Compared to the original process, this new discounted Markov decision process has richer properties, such as the differentiability of its optimal value function, strictly convexity of the value function, uniqueness of optimal policy, and the optimal value function and the optimal policy of both processes, are the same. To complement the theory presented, an example is provided.

Download Full-text

Autonomous Task Sequencing for Customized Curriculum Design in Reinforcement Learning

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/353 ◽

2017 ◽

Cited By ~ 7

Author(s):

Sanmit Narvekar ◽

Jivko Sinapov ◽

Peter Stone

Keyword(s):

Reinforcement Learning ◽

Transfer Learning ◽

Markov Decision Process ◽

Curriculum Design ◽

Optimal Policy ◽

Decision Process ◽

Improve Performance ◽

Task Sequencing ◽

Action Capabilities ◽

Markov Decision

Transfer learning is a method where an agent reuses knowledge learned in a source task to improve learning on a target task. Recent work has shown that transfer learning can be extended to the idea of curriculum learning, where the agent incrementally accumulates knowledge over a sequence of tasks (i.e. a curriculum). In most existing work, such curricula have been constructed manually. Furthermore, they are fixed ahead of time, and do not adapt to the progress or abilities of the agent. In this paper, we formulate the design of a curriculum as a Markov Decision Process, which directly models the accumulation of knowledge as an agent interacts with tasks, and propose a method that approximates an execution of an optimal policy in this MDP to produce an agent-specific curriculum. We use our approach to automatically sequence tasks for 3 agents with varying sensing and action capabilities in an experimental domain, and show that our method produces curricula customized for each agent that improve performance relative to learning from scratch or using a different agent's curriculum.

Download Full-text

Inverse reinforcement learning in contextual MDPs

Machine Learning ◽

10.1007/s10994-021-05984-x ◽

2021 ◽

Author(s):

Stav Belogolovsky ◽

Philip Korsunsky ◽

Shie Mannor ◽

Chen Tessler ◽

Tom Zahavy

Keyword(s):

Reinforcement Learning ◽

Optimization Problem ◽

Decision Processes ◽

Inverse Reinforcement Learning ◽

Convex Optimization Problem ◽

Reward Function ◽

Dynamic Treatment Regime ◽

Markov Decision ◽

Dynamic Treatment ◽

Recorded Data

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Download Full-text

An IoT based Smart Irrigation Management System using Reinforcement Learning modeled through a Markov Decision Process

10.1109/ds-rt52167.2021.9576130 ◽

2021 ◽

Author(s):

Luis Miguel Samaniego Campoverde ◽

Mauro Tropea ◽

Floriano De Rango

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Management System ◽

Irrigation Management ◽

Markov Decision

Download Full-text

Hydrocarbon Field Re-Development as Markov Decision Process

10.2118/206041-ms ◽

2021 ◽

Author(s):

Martin Sieberer ◽

Torsten Clemens

Keyword(s):

Markov Decision Process ◽

Decision Process ◽

Risk Attitude ◽

Transition Function ◽

Monetary Value ◽

Economic Parameters ◽

Markov Decision ◽

Model Ensembles ◽

Hydrocarbon Field ◽

Expected Monetary Value

Abstract Hydrocarbon field (re-)development requires that a multitude of decisions are made under uncertainty. These decisions include the type and size of surface facilities, location, configuration and number of wells but also which data to acquire. Both types of decisions, which development to choose and which data to acquire, are strongly coupled. The aim of appraisal is to maximize value while minimizing data acquisition costs. These decisions have to be done under uncertainty owing to the inherent uncertainty of the subsurface but also of other costs and economic parameters. Conventional Value Of Information (VOI) evaluations can be used to determine how much can be spend to acquire data. However, VOI is very challenging to calculate for complex sequences of decisions with various costs and including the risk attitude of the decision maker. We are using a fully observable Markov-Decision-Process (MDP) to determine the policy for the sequence and type of measurements and decisions to do. A fully observable MDP is characterised by the states (here: description of the system at a certain point in time), actions (here: measurements and development scenario), transition function (probabilities of transitioning from one state to the next), and rewards (costs for measurements, Expected Monetary Value (EMV) of development options). Solving the MDP gives the optimum policy, sequence of the decisions, the Probability Of Maturation (POM) of a project, the Expected Monetary Value (EMV), the expected loss, the expected appraisal costs, and the Probability of Economic Success (PES). These key performance indicators can then be used to select in a portfolio of projects the ones generating the highest expected reward for the company. Combining the production forecasts from numerical model ensembles with probabilistic capital and operating expenditures and economic parameters allows for quantitative decision making under uncertainty.

Download Full-text

A Multi-Step Reinforcement Learning Algorithm

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.44-47.3611 ◽

2010 ◽

Vol 44-47 ◽

pp. 3611-3615 ◽

Cited By ~ 1

Author(s):

Zhi Cong Zhang ◽

Kai Shun Hu ◽

Hui Yu Huang ◽

Shuai Li ◽

Shao Yong Zhao

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Large Scale ◽

Learning Algorithm ◽

Machine Learning Method ◽

Learning Method ◽

K Value ◽

Markov Decision ◽

Action Value

Reinforcement learning (RL) is a state or action value based machine learning method which approximately solves large-scale Markov Decision Process (MDP) or Semi-Markov Decision Process (SMDP). A multi-step RL algorithm called Sarsa(,k) is proposed, which is a compromised variation of Sarsa and Sarsa(). It is equivalent to Sarsa if k is 1 and is equivalent to Sarsa() if k is infinite. Sarsa(,k) adjust its performance by setting k value. Two forms of Sarsa(,k), forward view Sarsa(,k) and backward view Sarsa(,k), are constructed and proved equivalent in off-line updating.

Download Full-text