Apprenticeship Learning via Frank-Wolfe

We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, we are given a Markov Decision Process (MDP) without an explicit reward function. Instead, we observe an expert that acts according to some policy, and the goal is to find a policy whose feature expectations are closest to those of the expert policy. We formulate this problem as finding the projection of the feature expectations of the expert on the feature expectations polytope – the convex hull of the feature expectations of all the deterministic policies in the MDP. We show that this formulation is equivalent to the AL objective and that solving this problem using the FW algorithm is equivalent well-known Projection method of Abbeel and Ng (2004). This insight allows us to analyze AL with tools from convex optimization literature and derive tighter convergence bounds on AL. Specifically, we show that a variation of the FW method that is based on taking “away steps” achieves a linear rate of convergence when applied to AL and that a stochastic version of the FW algorithm can be used to avoid precise estimation of feature expectations. We also experimentally show that this version outperforms the FW baseline. To the best of our knowledge, this is the first work that shows linear convergence rates for AL.

Download Full-text

FLOW SHOP SCHEDULING WITH REINFORCEMENT LEARNING

Asia Pacific Journal of Operational Research ◽

10.1142/s0217595913500140 ◽

2013 ◽

Vol 30 (05) ◽

pp. 1350014 ◽

Cited By ~ 2

Author(s):

ZHICONG ZHANG ◽

WEIPING WANG ◽

SHOUYAN ZHONG ◽

KAISHUN HU

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Large Scale ◽

Flow Shop ◽

Flow Shop Scheduling ◽

Scheduling Problems ◽

Shop Scheduling ◽

Reward Function ◽

Markov Decision

Reinforcement learning (RL) is a state or action value based machine learning method which solves large-scale multi-stage decision problems such as Markov Decision Process (MDP) and Semi-Markov Decision Process (SMDP) problems. We minimize the makespan of flow shop scheduling problems with an RL algorithm. We convert flow shop scheduling problems into SMDPs by constructing elaborate state features, actions and the reward function. Minimizing the accumulated reward is equivalent to minimizing the schedule objective function. We apply on-line TD(λ) algorithm with linear gradient-descent function approximation to solve the SMDPs. To examine the performance of the proposed RL algorithm, computational experiments are conducted on benchmarking problems in comparison with other scheduling methods. The experimental results support the efficiency of the proposed algorithm and illustrate that the RL approach is a promising computational approach for flow shop scheduling problems worthy of further investigation.

Download Full-text

Carrier-borne aircrafts aviation operation automated scheduling using multiplicative weights apprenticeship learning

International Journal of Advanced Robotic Systems ◽

10.1177/1729881419828917 ◽

2019 ◽

Vol 16 (1) ◽

pp. 172988141982891 ◽

Cited By ~ 1

Author(s):

Mao Zheng ◽

Fangqing Yang ◽

Zaopeng Dong ◽

Shuo Xie ◽

Xiumin Chu

Keyword(s):

Artificial Intelligence ◽

Decision Process ◽

Learning Algorithm ◽

Aircraft Carrier ◽

Scheduling Policy ◽

Apprenticeship Learning ◽

Markov Decision ◽

Artificial Intelligence Technology ◽

Aviation Operations ◽

Simulative Model

Efficiency and safety are vital for aviation operations in order to improve the combat capacity of aircraft carrier. In this article, the theory of apprenticeship learning, as a kind of artificial intelligence technology, is applied to constructing the method of automated scheduling. First, with the use of Markov decision process frame, the simulative model of aircrafts launching and recovery was established. Second, the multiplicative weights apprenticeship learning algorithm was applied to creating the optimized scheduling policy. In the situation with an expert to learn from, the learned policy matches quite well with the expert’s demonstration and the total deviations can be limited within 3%. Finally, in the situation without expert’s demonstration, the policy generated by multiplicative weights apprenticeship learning algorithm shows an obvious superiority compared to the three human experts. The results of different operation situations show that the method is highly robust and well functional.

Download Full-text

Efficient PAC Reinforcement Learning in Regular Decision Processes

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/279 ◽

2021 ◽

Author(s):

Alessandro Ronca ◽

Giuseppe De Giacomo

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Polynomial Time ◽

Optimal Policy ◽

Decision Process ◽

Transition Function ◽

Decision Processes ◽

Reward Function ◽

Markov Decision ◽

Reward Functions

Recently regular decision processes have been proposed as a well-behaved form of non-Markov decision process. Regular decision processes are characterised by a transition function and a reward function that depend on the whole history, though regularly (as in regular languages). In practice both the transition and the reward functions can be seen as finite transducers. We study reinforcement learning in regular decision processes. Our main contribution is to show that a near-optimal policy can be PAC-learned in polynomial time in a set of parameters that describe the underlying decision process. We argue that the identified set of parameters is minimal and it reasonably captures the difficulty of a regular decision process.

Download Full-text

A Markov-Based Multi-Attribute Vertical Handoff Decision Algorithm

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.785-786.1403 ◽

2013 ◽

Vol 785-786 ◽

pp. 1403-1407

Author(s):

Qing Yang Song ◽

Xun Li ◽

Shu Yu Ding ◽

Zhao Long Ning

Keyword(s):

Markov Decision Process ◽

Decision Process ◽

Decision Problem ◽

Vertical Handoff ◽

Decision Algorithm ◽

Reward Function ◽

Total Reward ◽

Call Dropping ◽

Markov Decision ◽

The Impact

Many vertical handoff decision algorithms have not considered the impact of call dropping during the vertical handoff decision process. Besides, most of current multi-attribute vertical handoff algorithms cannot predict users’ specific circumstances dynamically. In this paper, we formulate the vertical handoff decision problem as a Markov decision process, with the objective of maximizing the expected total reward during the handoff procedure. A reward function is formulated to assess the service quality during each connection. The G1 and entropy methods are applied in an iterative way, by which we can work out a stationary deterministic policy. Numerical results demonstrate the superiority of our proposed algorithm compared with the existing methods.

Download Full-text

An Overview of Inverse Reinforcement Learning Techniques

Intelligent Environments 2021 - Ambient Intelligence and Smart Environments ◽

10.3233/aise210097 ◽

2021 ◽

Author(s):

Syed Ihtesham Hussain Shah ◽

Giuseppe De Pietro

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Decision Process ◽

Autonomous Agents ◽

Theoretical Background ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Learning Techniques ◽

Markov Decision ◽

Potential Use

In decision-making problems reward function plays an important role in finding the best policy. Reinforcement Learning (RL) provides a solution for decision-making problems under uncertainty in an Intelligent Environment (IE). However, it is difficult to specify the reward function for RL agents in large and complex problems. To counter these problems an extension of RL problem named Inverse Reinforcement Learning (IRL) is introduced, where reward function is learned from expert demonstrations. IRL is appealing for its potential use to build autonomous agents, capable of modeling others, deprived of compromising in performance of the task. This approach of learning by demonstrations relies on the framework of Markov Decision Process (MDP). This article elaborates original IRL algorithms along with their close variants to mitigate challenges. The purpose of this paper is to highlight an overview and theoretical background of IRL in the field of Machine Learning (ML) and Artificial Intelligence (AI). We presented a brief comparison between different variants of IRL in this article.

Download Full-text

Automatic first-arrival picking method via intelligent Markov optimal decision processes

Journal of Geophysics and Engineering ◽

10.1093/jge/gxab026 ◽

2021 ◽

Vol 18 (3) ◽

pp. 406-417

Author(s):

Fei Luo ◽

Bo Feng ◽

Huazhong Wang

Keyword(s):

Markov Decision Process ◽

Seismic Data ◽

Decision Process ◽

Signal To Noise Ratio ◽

Optimal Decision ◽

Reward Function ◽

Markov Decision ◽

First Arrival ◽

Optimized Model ◽

Feature Attribute

Abstract Picking the first arrival is an important step in seismic processing. The large volume of the seismic data calls for automatic and objective picking. In this paper, we formulate first-arrival picking as an intelligent Markov decision process in the multi-dimensional feature attribute space. By designing a reasonable model, the global optimization is carried out in the reward function space to obtain the path with the largest cumulative reward value, to achieve the purpose of automatically picking up the first arrival. The state-value function contains a distance-related discount factor γ, which enables the Markov decision process to pick up the first-arrival continuity to consider the lateral continuity of the seismic data and avoid the bad trace information in the seismic data. On this basis, the method of this paper further introduces the optimized model that is a fuzzy clustering-based multi-dimensional attribute reward function and structure-based Gaussian stochastic policy, thereby reducing the difficulty of model design, and making the seismic data pick up more accurately and automatically. Testing this approach in the field seismic data reveals its properties and shows it can automatically pick up more reasonable first arrivals and has a certain quality control ability, especially the first-arrival energy is weak (the signal-to-noise ratio is low) or there are adjacent complex waveforms in the shallow layer.

Download Full-text

Controllable Summarization with Constrained Markov Decision Process

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00423 ◽

2021 ◽

Vol 9 ◽

pp. 1213-1232

Author(s):

Hou Pong Chan ◽

Lu Wang ◽

Irwin King

Keyword(s):

Markov Decision Process ◽

Decision Process ◽

Gain Control ◽

Text Summarization ◽

Reward Function ◽

Markov Decision ◽

Constrained Markov Decision Process

Abstract We study controllable text summarization, which allows users to gain control on a particular attribute (e.g., length limit) of the generated summaries. In this work, we propose a novel training framework based on Constrained Markov Decision Process (CMDP), which conveniently includes a reward function along with a set of constraints, to facilitate better summarization control. The reward function encourages the generation to resemble the human-written reference, while the constraints are used to explicitly prevent the generated summaries from violating user-imposed requirements. Our framework can be applied to control important attributes of summarization, including length, covered entities, and abstractiveness, as we devise specific constraints for each of these aspects. Extensive experiments on popular benchmarks show that our CMDP framework helps generate informative summaries while complying with a given attribute’s requirement.1

Download Full-text

A Markov Decision Process Approach for Cost-Benefit Analysis of Infrastructure Resilience Upgrades

SSRN Electronic Journal ◽

10.2139/ssrn.3657479 ◽

2020 ◽

Author(s):

Qianru Zhu ◽

Benjamin D. Leibowicz

Keyword(s):

Markov Decision Process ◽

Decision Process ◽

Cost Benefit Analysis ◽

Cost Benefit ◽

Process Approach ◽

Benefit Analysis ◽

Markov Decision ◽

Infrastructure Resilience

Download Full-text

Inverse reinforcement learning in contextual MDPs

Machine Learning ◽

10.1007/s10994-021-05984-x ◽

2021 ◽

Author(s):

Stav Belogolovsky ◽

Philip Korsunsky ◽

Shie Mannor ◽

Chen Tessler ◽

Tom Zahavy

Keyword(s):

Reinforcement Learning ◽

Optimization Problem ◽

Decision Processes ◽

Inverse Reinforcement Learning ◽

Convex Optimization Problem ◽

Reward Function ◽

Dynamic Treatment Regime ◽

Markov Decision ◽

Dynamic Treatment ◽

Recorded Data

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Download Full-text

A Markov Decision Process Workflow for Automating Interior Design

KSCE Journal of Civil Engineering ◽

10.1007/s12205-021-1272-6 ◽

2021 ◽

Author(s):

Ebrahim Karan ◽

Sadegh Asgari ◽

Abbas Rashidi

Keyword(s):

Markov Decision Process ◽

Interior Design ◽

Decision Process ◽

Markov Decision

Download Full-text