Solving Continual Combinatorial Selection via Deep Reinforcement Learning

We consider the Markov Decision Process (MDP) of selecting a subset of items at each step, termed the Select-MDP (S-MDP). The large state and action spaces of S-MDPs make them intractable to solve with typical reinforcement learning (RL) algorithms especially when the number of items is huge. In this paper, we present a deep RL algorithm to solve this issue by adopting the following key ideas. First, we convert the original S-MDP into an Iterative Select-MDP (IS-MDP), which is equivalent to the S-MDP in terms of optimal actions. IS-MDP decomposes a joint action of selecting K items simultaneously into K iterative selections resulting in the decrease of actions at the expense of an exponential increase of states. Second, we overcome this state space explosion by exploiting a special symmetry in IS-MDPs with novel weight shared Q-networks, which provably maintain sufficient expressive power. Various experiments demonstrate that our approach works well even when the item space is large and that it scales to environments with item spaces different from those used in training.

Download Full-text

Planning without state space explosion: Petri net to Markov decision process

International Transactions in Operational Research ◽

10.1111/j.1475-3995.2009.00674.x ◽

2009 ◽

Vol 16 (2) ◽

pp. 243-255 ◽

Cited By ~ 2

Author(s):

Sanjeev Naguleswaran ◽

Langford B. White

Keyword(s):

State Space ◽

Markov Decision Process ◽

Petri Net ◽

Decision Process ◽

State Space Explosion ◽

Markov Decision

Download Full-text

An IoT based Smart Irrigation Management System using Reinforcement Learning modeled through a Markov Decision Process

10.1109/ds-rt52167.2021.9576130 ◽

2021 ◽

Author(s):

Luis Miguel Samaniego Campoverde ◽

Mauro Tropea ◽

Floriano De Rango

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Management System ◽

Irrigation Management ◽

Markov Decision

Download Full-text

A Multi-Step Reinforcement Learning Algorithm

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.44-47.3611 ◽

2010 ◽

Vol 44-47 ◽

pp. 3611-3615 ◽

Cited By ~ 1

Author(s):

Zhi Cong Zhang ◽

Kai Shun Hu ◽

Hui Yu Huang ◽

Shuai Li ◽

Shao Yong Zhao

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Large Scale ◽

Learning Algorithm ◽

Machine Learning Method ◽

Learning Method ◽

K Value ◽

Markov Decision ◽

Action Value

Reinforcement learning (RL) is a state or action value based machine learning method which approximately solves large-scale Markov Decision Process (MDP) or Semi-Markov Decision Process (SMDP). A multi-step RL algorithm called Sarsa(,k) is proposed, which is a compromised variation of Sarsa and Sarsa(). It is equivalent to Sarsa if k is 1 and is equivalent to Sarsa() if k is infinite. Sarsa(,k) adjust its performance by setting k value. Two forms of Sarsa(,k), forward view Sarsa(,k) and backward view Sarsa(,k), are constructed and proved equivalent in off-line updating.

Download Full-text

Cooperative retransmissions using Markov decision process with reinforcement learning

2009 IEEE 20th International Symposium on Personal, Indoor and Mobile Radio Communications ◽

10.1109/pimrc.2009.5450098 ◽

2009 ◽

Cited By ~ 1

Author(s):

Ghasem Naddafzadeh Shirazi ◽

Peng-Yong Kong ◽

Chen-Khong Tham

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Markov Decision

Download Full-text

A convergent recursive least squares approximate policy iteration algorithm for multi-dimensional Markov decision process with continuous state and action spaces

2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning ◽

10.1109/adprl.2009.4927527 ◽

2009 ◽

Cited By ~ 7

Author(s):

Jun Ma ◽

Warren B. Powell

Keyword(s):

Least Squares ◽

Markov Decision Process ◽

Decision Process ◽

Recursive Least Squares ◽

Iteration Algorithm ◽

Continuous State ◽

Markov Decision ◽

Approximate Policy Iteration ◽

Policy Iteration Algorithm ◽

Action Spaces

Download Full-text

Continuous-time Markov decision process with average reward: Using reinforcement learning method

2015 34th Chinese Control Conference (CCC) ◽

10.1109/chicc.2015.7260117 ◽

2015 ◽

Author(s):

Shengde Jia ◽

Lincheng Shen ◽

Hongtao Xue

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Continuous Time ◽

Decision Process ◽

Learning Method ◽

Average Reward ◽

Markov Decision

Download Full-text

Blackwell optimal policies in a Markov decision process with a Borel state space

Mathematical Methods of Operations Research ◽

10.1007/bf01432969 ◽

1994 ◽

Vol 40 (3) ◽

pp. 253-288 ◽

Cited By ~ 8

Author(s):

A. A. Yushkevich

Keyword(s):

State Space ◽

Markov Decision Process ◽

Decision Process ◽

Borel State Space ◽

Optimal Policies ◽

Markov Decision

Download Full-text

Universal Reinforcement Learning Algorithms: Survey and Experiments

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/194 ◽

2017 ◽

Author(s):

John Aslanides ◽

Jan Leike ◽

Marcus Hutter

Keyword(s):

Reinforcement Learning ◽

Open Source ◽

Markov Decision Process ◽

Decision Process ◽

Empirical Investigation ◽

State Of The Art ◽

Learning Algorithms ◽

Markov Decision ◽

Reference Implementation ◽

Partially Observable

Many state-of-the-art reinforcement learning (RL) algorithms typically assume that the environment is an ergodic Markov Decision Process (MDP). In contrast, the field of universal reinforcement learning (URL) is concerned with algorithms that make as few assumptions as possible about the environment. The universal Bayesian agent AIXI and a family of related URL algorithms have been developed in this setting. While numerous theoretical optimality results have been proven for these agents, there has been no empirical investigation of their behavior to date. We present a short and accessible survey of these URL algorithms under a unified notation and framework, along with results of some experiments that qualitatively illustrate some properties of the resulting policies, and their relative performance on partially-observable gridworld environments. We also present an open- source reference implementation of the algorithms which we hope will facilitate further understanding of, and experimentation with, these ideas.

Download Full-text

Design Synthesis through a Markov Decision Process and Reinforcement Learning Framework

Journal of Computing and Information Science in Engineering ◽

10.1115/1.4051598 ◽

2021 ◽

pp. 1-19

Author(s):

Maximilian Ororbia ◽

Gordon P. Warn

Keyword(s):

Reinforcement Learning ◽

Optimal Design ◽

Markov Decision Process ◽

Decision Process ◽

Plastic Material ◽

Cross Sectional ◽

Design Synthesis ◽

Learning Agent ◽

Markov Decision ◽

Elastic Plastic Material

Abstract This paper presents a framework that mathematically models optimal design synthesis as a Markov Decision Process that is solved with reinforcement learning. In this context, the states correspond to specific design configurations, the actions correspond to the available alterations modeled after generative design grammars, and the immediate rewards are constructed to be related to the improvement in the altered configuration's performance with respect to the design objective. Since in the context of optimal design synthesis the immediate rewards are in general not known at the onset of the process, reinforcement learning is employed to efficiently solve the MDP. The goal of the reinforcement learning agent is to maximize the cumulative rewards and hence synthesize the best performing or optimal design. The framework is demonstrated for the optimization of planar trusses with binary cross-sectional areas, and its utility is investigated with four numerical examples, each with a unique combination of domain, constraint, and external force(s) considering both linear-elastic and elastic-plastic material behaviors. The design solutions obtained with the framework are also compared with other methods in order to demonstrate its efficiency and accuracy.

Download Full-text

Reinforcement Learning Enables Field-Development Policy Optimization

Journal of Petroleum Technology ◽

10.2118/0921-0046-jpt ◽

2021 ◽

Vol 73 (09) ◽

pp. 46-47

Author(s):

Chris Carpenter

Keyword(s):

Dynamic Programming ◽

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Development Policy ◽

The State ◽

Field Development ◽

Markov Decision ◽

Policy Optimization ◽

Sequential Nature

This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 201254, “Reinforcement Learning for Field-Development Policy Optimization,” by Giorgio De Paola, SPE, and Cristina Ibanez-Llano, Repsol, and Jesus Rios, IBM, et al., prepared for the 2020 SPE Annual Technical Conference and Exhibition, originally scheduled to be held in Denver, Colorado, 5–7 October. The paper has not been peer reviewed. A field-development plan consists of a sequence of decisions. Each action taken affects the reservoir and conditions any future decision. The presence of uncertainty associated with this process, however, is undeniable. The novelty of the approach proposed by the authors in the complete paper is the consideration of the sequential nature of the decisions through the framework of dynamic programming (DP) and reinforcement learning (RL). This methodology allows moving the focus from a static field-development plan optimization to a more-dynamic framework that the authors call field-development policy optimization. This synopsis focuses on the methodology, while the complete paper also contains a real-field case of application of the methodology. Methodology Deep RL (DRL). RL is considered an important learning paradigm in artificial intelligence (AI) but differs from supervised or unsupervised learning, the most commonly known types currently studied in the field of machine learning. During the last decade, RL has attracted greater attention because of success obtained in applications related to games and self-driving cars resulting from its combination with deep-learning architectures such as DRL, which has allowed RL to scale on to previously unsolvable problems and, therefore, solve much larger sequential decision problems. RL, also referred to as stochastic approximate dynamic programming, is a goal-directed sequential-learning-from-interaction paradigm. The learner or agent is not told what to do but instead has to learn which actions or decisions yield a maximum reward through interaction with an uncertain environment without losing too much reward along the way. This way of learning from interaction to achieve a goal must be achieved in balance with the exploration and exploitation of possible actions. Another key characteristic of this type of problem is its sequential nature, where the actions taken by the agent affect the environment itself and, therefore, the subsequent data it receives and the subsequent actions to be taken. Mathematically, such problems are formulated in the framework of the Markov decision process (MDP) that primarily arises in the field of optimal control. An RL problem consists of two principal parts: the agent, or decision-making engine, and the environment, the interactive world for an agent (in this case, the reservoir). Sequentially, at each timestep, the agent takes an action (e.g., changing control rates or deciding a well location) that makes the environment (reservoir) transition from one state to another. Next, the agent receives a reward (e.g., a cash flow) and an observation of the state of the environment (partial or total) before taking the next action. All relevant information informing the agent of the state of the system is assumed to be included in the last state observed by the agent (Markov property). If the agent observes the full environment state once it has acted, the MDP is said to be fully observable; otherwise, a partially observable Markov decision process (POMDP) results. The agent’s objective is to learn policy mapping from states (MDPs) or histories (POMDPs) to actions such that the agent’s cumulated (discounted) reward in the long run is maximized.

Download Full-text