Model-Free IRL Using Maximum Likelihood Estimation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013951 ◽

2019 ◽

Vol 33 ◽

pp. 3951-3958

Author(s):

Vinamra Jain ◽

Prashant Doshi ◽

Bikramjit Banerjee

Keyword(s):

Reinforcement Learning ◽

Maximum Likelihood ◽

Likelihood Estimation ◽

Transition Function ◽

Inverse Reinforcement Learning ◽

Q Learning ◽

Reward Function ◽

Model Free ◽

Model Free Approach ◽

Q Function

The problem of learning an expert’s unknown reward function using a limited number of demonstrations recorded from the expert’s behavior is investigated in the area of inverse reinforcement learning (IRL). To gain traction in this challenging and underconstrained problem, IRL methods predominantly represent the reward function of the expert as a linear combination of known features. Most of the existing IRL algorithms either assume the availability of a transition function or provide a complex and inefficient approach to learn it. In this paper, we present a model-free approach to IRL, which casts IRL in the maximum likelihood framework. We present modifications of the model-free Q-learning that replace its maximization to allow computing the gradient of the Q-function. We use gradient ascent to update the feature weights to maximize the likelihood of expert’s trajectories. We demonstrate on two problem domains that our approach improves the likelihood compared to previous methods.

Download Full-text

Dealing with multiple experts and non-stationarity in inverse reinforcement learning: an application to real-life problems

Machine Learning ◽

10.1007/s10994-020-05939-8 ◽

2021 ◽

Author(s):

Amarildo Likmeta ◽

Alberto Maria Metelli ◽

Giorgia Ramponi ◽

Andrea Tirinzoni ◽

Matteo Giuliani ◽

...

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Real Life ◽

User Preferences ◽

Inverse Reinforcement Learning ◽

Water Release ◽

Reward Function ◽

Model Free ◽

Conflicting Objectives ◽

Multiple Experts

AbstractIn real-world applications, inferring the intentions of expert agents (e.g., human operators) can be fundamental to understand how possibly conflicting objectives are managed, helping to interpret the demonstrated behavior. In this paper, we discuss how inverse reinforcement learning (IRL) can be employed to retrieve the reward function implicitly optimized by expert agents acting in real applications. Scaling IRL to real-world cases has proved challenging as typically only a fixed dataset of demonstrations is available and further interactions with the environment are not allowed. For this reason, we resort to a class of truly batch model-free IRL algorithms and we present three application scenarios: (1) the high-level decision-making problem in the highway driving scenario, and (2) inferring the user preferences in a social network (Twitter), and (3) the management of the water release in the Como Lake. For each of these scenarios, we provide formalization, experiments and a discussion to interpret the obtained results.

Download Full-text

On the use of the policy gradient and Hessian in inverse reinforcement learning

Intelligenza Artificiale ◽

10.3233/ia-180011 ◽

2020 ◽

Vol 14 (1) ◽

pp. 117-150

Author(s):

Alberto Maria Metelli ◽

Matteo Pirotta ◽

Marcello Restelli

Keyword(s):

Reinforcement Learning ◽

Sequential Decision ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Model Free ◽

Learning Speed ◽

Policy Gradient ◽

Continuous Domains ◽

Learning Policies ◽

Finite Domains

Reinforcement Learning (RL) is an effective approach to solve sequential decision making problems when the environment is equipped with a reward function to evaluate the agent’s actions. However, there are several domains in which a reward function is not available and difficult to estimate. When samples of expert agents are available, Inverse Reinforcement Learning (IRL) allows recovering a reward function that explains the demonstrated behavior. Most of the classic IRL methods, in addition to expert’s demonstrations, require sampling the environment to evaluate each reward function, that, in turn, is built starting from a set of engineered features. This paper is about a novel model-free IRL approach that does not require to specify a function space where to search for the expert’s reward function. Leveraging on the fact that the policy gradient needs to be zero for an optimal policy, the algorithm generates an approximation space for the reward function, in which a reward is singled out employing a second-order criterion. After introducing our approach for finite domains, we extend it to continuous ones. The empirical results, on both finite and continuous domains, show that the reward function recovered by our algorithm allows learning policies that outperform those obtained with the true reward function, in terms of learning speed.

Download Full-text

Better Than Maximum Likelihood Estimation of Model-Based And Model-Free Learning Style

10.21203/rs.3.rs-880233/v1 ◽

2021 ◽

Author(s):

Sadjad Yazdani ◽

Abdol-Hossein Vahabie ◽

Babak Nadjar Araabi ◽

Majid Nili Ahmadabadi

Keyword(s):

Reinforcement Learning ◽

Maximum Likelihood ◽

Learning Style ◽

Human Behavior ◽

Nearest Neighbor ◽

Estimation Error ◽

Likelihood Estimation ◽

Estimation Methods ◽

Model Based ◽

Model Free

Abstract Various decision-making systems work together to shape human behavior. Habitual and goal-directed systems are the two most important ones that are studied by reinforcement learning (RL), using model-free and model-based learning methods, respectively. Human behavior resembles the weighted combination of these two systems. Such a combination is modeled by the weighted sum of action values of the model-based and model-free systems. The weighting parameter has been mostly extracted by "maximum likelihood" or "maximum a-posteriori" estimation methods. In this study, we show these two well-known methods bring many challenges, and their respective extracted values are less reliable, especially in the case of limited sample size or at the proximity of extremes values. We propose that using k‑nearest neighbor, as a free format estimate, can improve the estimation error. k-nn uses global information extracted from the behavior such as stay probability, along with fitted values. The proposed method is examined by simulated experiments, where obtained results indicate the advantage of our method in reducing both bias and variance of the error. Investigation of the human behavior data from previous studies shows that the proposed method results in more statistically robust estimates in predicting other behavioral indices such as the number of gaze directions toward each target or symptoms of some psychiatric disorders. In brief, the proposed method increases the reliability of the estimated parameters and enhances the applicability of reinforcement learning paradigms in clinical trials.

Download Full-text

Better than maximum likelihood estimation of model-based and model-free learning style

10.1101/296335 ◽

2018 ◽

Author(s):

Sadjad Yazdani ◽

Abdol-Hossein Vahabie ◽

Babak Nadjar Araabi ◽

Majid Nili Ahmadabadi

Keyword(s):

Reinforcement Learning ◽

Maximum Likelihood ◽

Learning Style ◽

Human Behavior ◽

Nearest Neighbor ◽

Estimation Error ◽

Likelihood Estimation ◽

K Nearest Neighbor ◽

Model Based ◽

Model Free

AbstractMultiple decision making systems work together to shape the final choices in human behavior. Habitual and goal-directed systems are the two most important systems that are studied in the reinforcement learning (RL) literature by model-free and model-based learning methods. Human behavior resembles the weighted combination of these systems and such a combination is modeled by weighted summation of action’s value from the model based and model free systems. Extraction of this weighted parameter, which is important for many applications and computational modeling, has been mostly based on the maximum likelihood or maximum a posteriori methods. We show these methods bring many challenges and their respective extracted values are less reliable especially in the proximity of extremes values. We propose that using a free format learning method (k-nearest neighbor) which uses more information besides the fitted values e.g. global information like stay probability instead of trial by trial information can ameliorate the estimation error. The proposed method is examined by simulation and results show the advantage of the proposed method. In addition, investigation of the human behavior data from previous researchers proved the proposed method to result in more statistically robust results in predicting other behavioral indices such as the number of gaze directions toward each target. In brief, the proposed method increases the reliability of the estimated parameters and enhances the applicability of reinforcement learning paradigms in clinical trials.

Download Full-text

Inverse Reinforcement Learning for Team Sports: Valuing Actions and Players

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/464 ◽

2020 ◽

Author(s):

Yudong Luo ◽

Oliver Schulte ◽

Pascal Poupart

Keyword(s):

Reinforcement Learning ◽

Single Agent ◽

Empirical Evaluation ◽

Team Sports ◽

Ice Hockey ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Player Ranking ◽

The Impact ◽

Q Function

A major task of sports analytics is to rank players based on the impact of their actions. Recent methods have applied reinforcement learning (RL) to assess the value of actions from a learned action value or Q-function. A fundamental challenge for estimating action values is that explicit reward signals (goals) are very sparse in many team sports, such as ice hockey and soccer. This paper combines Q-function learning with inverse reinforcement learning (IRL) to provide a novel player ranking method. We treat professional play as expert demonstrations for learning an implicit reward function. Our method alternates single-agent IRL to learn a reward function for multiple agents; we provide a theoretical justification for this procedure. Knowledge transfer is used to combine learned rewards and observed rewards from goals. Empirical evaluation, based on 4.5M play-by-play events in the National Hockey League (NHL), indicates that player ranking using the learned rewards achieves high correlations with standard success measures and temporal consistency throughout a season.

Download Full-text

Inverse reinforcement learning in contextual MDPs

Machine Learning ◽

10.1007/s10994-021-05984-x ◽

2021 ◽

Author(s):

Stav Belogolovsky ◽

Philip Korsunsky ◽

Shie Mannor ◽

Chen Tessler ◽

Tom Zahavy

Keyword(s):

Reinforcement Learning ◽

Optimization Problem ◽

Decision Processes ◽

Inverse Reinforcement Learning ◽

Convex Optimization Problem ◽

Reward Function ◽

Dynamic Treatment Regime ◽

Markov Decision ◽

Dynamic Treatment ◽

Recorded Data

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Download Full-text

Integrating Production Planning with Truck-Dispatching Decisions through Reinforcement Learning While Managing Uncertainty

Minerals ◽

10.3390/min11060587 ◽

2021 ◽

Vol 11 (6) ◽

pp. 587

Author(s):

Joao Pedro de Carvalho ◽

Roussos Dimitrakopoulos

Keyword(s):

Reinforcement Learning ◽

Discrete Event ◽

Mining Operations ◽

Fixed Sequence ◽

Q Learning ◽

Reward Function ◽

Copper Gold ◽

Mining Complex ◽

Learning Reinforcement ◽

Operational Plan

This paper presents a new truck dispatching policy approach that is adaptive given different mining complex configurations in order to deliver supply material extracted by the shovels to the processors. The method aims to improve adherence to the operational plan and fleet utilization in a mining complex context. Several sources of operational uncertainty arising from the loading, hauling and dumping activities can influence the dispatching strategy. Given a fixed sequence of extraction of the mining blocks provided by the short-term plan, a discrete event simulator model emulates the interaction arising from these mining operations. The continuous repetition of this simulator and a reward function, associating a score value to each dispatching decision, generate sample experiences to train a deep Q-learning reinforcement learning model. The model learns from past dispatching experience, such that when a new task is required, a well-informed decision can be quickly taken. The approach is tested at a copper–gold mining complex, characterized by uncertainties in equipment performance and geological attributes, and the results show improvements in terms of production targets, metal production, and fleet management.

Download Full-text

Deep Inverse Reinforcement Learning for Reward Function Identification in Bidding Models

IEEE Transactions on Power Systems ◽

10.1109/tpwrs.2021.3076296 ◽

2021 ◽

pp. 1-1

Author(s):

Hongye Guo ◽

Qixin Chen ◽

Qing Xia ◽

Chongqing Kang

Keyword(s):

Reinforcement Learning ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Function Identification ◽

Bidding Models

Download Full-text

Preceding vehicle following algorithm with human driving characteristics

Proceedings of the Institution of Mechanical Engineers Part D Journal of Automobile Engineering ◽

10.1177/0954407020981546 ◽

2021 ◽

pp. 095440702098154

Author(s):

Feng Pan ◽

Hong Bao

Keyword(s):

Reinforcement Learning ◽

Weight Vector ◽

Gradient Algorithm ◽

Inner Product ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Human Driver ◽

Policy Gradient ◽

Preceding Vehicle ◽

Action Spaces

This paper proposes a new approach of using reinforcement learning (RL) to train an agent to perform the task of vehicle following with human driving characteristics. We refer to the ideal of inverse reinforcement learning to design the reward function of the RL model. The factors that need to be weighed in vehicle following were vectorized into reward vectors, and the reward function was defined as the inner product of the reward vector and weights. Driving data of human drivers was collected and analyzed to obtain the true reward function. The RL model was trained with the deterministic policy gradient algorithm because the state and action spaces are continuous. We adjusted the weight vector of the reward function so that the value vector of the RL model could continuously approach that of a human driver. After dozens of rounds of training, we selected the policy with the nearest value vector to that of a human driver and tested it in the PanoSim simulation environment. The results showed the desired performance for the task of an agent following the preceding vehicle safely and smoothly.

Download Full-text

Model-Free Deep Inverse Reinforcement Learning by Logistic Regression

Neural Processing Letters ◽

10.1007/s11063-017-9702-7 ◽

2017 ◽

Vol 47 (3) ◽

pp. 891-905 ◽

Cited By ~ 4

Author(s):

Eiji Uchibe

Keyword(s):

Logistic Regression ◽

Reinforcement Learning ◽

Inverse Reinforcement Learning ◽

Model Free

Download Full-text