Approximate Inverse Reinforcement Learning from Vision-based Imitation Learning

High Quality ◽

Overall Performance

The performance of adversarial dialogue generation models relies on the quality of the reward signal produced by the discriminator. The reward signal from a poor discriminator can be very sparse and unstable, which may lead the generator to fall into a local optimum or to produce nonsense replies. To alleviate the first problem, we first extend a recently proposed adversarial dialogue generation method to an adversarial imitation learning solution. Then, in the framework of adversarial inverse reinforcement learning, we propose a new reward model for dialogue generation that can provide a more accurate and precise reward signal for generator training. We evaluate the performance of the resulting model with automatic metrics and human evaluations in two annotation settings. Our experimental results demonstrate that our model can generate more high-quality responses and achieve higher overall performance than the state-of-the-art.

Generative Adversarial Immitation Learning for Steering an Unmanned Surface Vehicle

Proceedings of the Northern Lights Deep Learning Workshop ◽

10.7557/18.5147 ◽

2020 ◽

Vol 1 ◽

pp. 6

Author(s):

Alexandra Vedeler ◽

Narada Warakagoda

Keyword(s):

Reinforcement Learning ◽

Obstacle Avoidance ◽

Complex Dynamics ◽

Imitation Learning ◽

Radar Sensor ◽

Single Action ◽

Reward Function ◽

End To End ◽

Insight Into

The task of obstacle avoidance using maritime vessels, such as Unmanned Surface Vehicles (USV), has traditionally been solved using specialized modules that are designed and optimized separately. However, this approach requires a deep insight into the environment, the vessel, and their complex dynamics. We propose an alternative method using Imitation Learning (IL) through Deep Reinforcement Learning (RL) and Deep Inverse Reinforcement Learning (IRL) and present a system that learns an end-to-end steering model capable of mapping radar-like images directly to steering actions in an obstacle avoidance scenario. The USV used in the work is equipped with a Radar sensor and we studied the problem of generating a single action parameter, heading. We apply an IL algorithm known as generative adversarial imitation learning (GAIL) to develop an end-to-end steering model for a scenario where avoidance of an obstacle is the goal. The performance of the system was studied for different design choices and compared to that of a system that is based on pure RL. The IL system produces results that indicate it is able to grasp the concept of the task and that in many ways are on par with the RL system. We deem this to be promising for future use in tasks that are not as easily described by a reward function.

Inverse Reinforcement Learning and Imitation Learning

Machine Learning in Finance ◽

10.1007/978-3-030-41068-1_11 ◽

2020 ◽

pp. 419-517

Author(s):

Matthew F. Dixon ◽

Igor Halperin ◽

Paul Bilokon

Keyword(s):

Reinforcement Learning ◽

Imitation Learning ◽

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

Robot Manipulation Learning Using Generative Adversarial Imitation Learning

10.24963/ijcai.2021/678 ◽

2021 ◽

Author(s):

Mohamed Khalil Jabri

Keyword(s):

Reinforcement Learning ◽

Imitation Learning ◽

Generative Adversarial Networks ◽

Temporal Consistency ◽

Complex Environments ◽

Learning Problem ◽

Robot Manipulation ◽

Adversarial Networks

Imitation learning allows learning complex behaviors given demonstrations. Early approaches belonging to either Behavior Cloning or Inverse Reinforcement Learning were however of limited scalability to complex environments. A more promising approach termed as Generative Adversarial Imitation Learning tackles the imitation learning problem by drawing a connection with Generative Adversarial Networks. In this work, we advocate the use of this class of methods and investigate possible extensions by endowing them with global temporal consistency, in particular through a contrastive learning based approach.

Generative Imitation Learning using Forward and Inverse Reinforcement Learning

Journal of the Robotics Society of Japan ◽

10.7210/jrsj.39.617 ◽

2021 ◽

Vol 39 (7) ◽

pp. 617-620

Author(s):

Eiji Uchibe

Keyword(s):

Reinforcement Learning ◽

Imitation Learning ◽

Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning

IEEE Transactions on Neural Networks and Learning Systems ◽

10.1109/tnnls.2016.2543000 ◽

2017 ◽

Vol 28 (8) ◽

pp. 1814-1826 ◽

Cited By ~ 18

Author(s):

Bilal Piot ◽

Matthieu Geist ◽

Olivier Pietquin

Keyword(s):

Reinforcement Learning ◽

Imitation Learning ◽

Ensemble Inverse Reinforcement Learning from Semi-Expert Agents

IEEJ Transactions on Electronics Information and Systems ◽

10.1541/ieejeiss.137.667 ◽

2017 ◽

Vol 137 (4) ◽

pp. 667-673

Author(s):

Shinji Tomita ◽

Fumiya Hamatsu ◽

Tomoki Hamagami

Keyword(s):

Reinforcement Learning ◽

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

Teaching AI Agents Ethical Values Using Reinforcement Learning and Policy Orchestration

10.24963/ijcai.2019/891 ◽

2019 ◽

Author(s):

Ritesh Noothigattu ◽

Djallel Bouneffouf ◽

Nicholas Mattei ◽

Rachita Chandra ◽

Piyush Madan ◽

...

Keyword(s):

Reinforcement Learning ◽

Ethical Values ◽

Large Role ◽

Learning To Learn ◽

Time Step ◽

Novel Approach

Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in ways aligned with the values of society, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. We detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations and reinforcement learning to learn to maximize environmental rewards. A contextual bandit-based orchestrator then picks between the two policies: constraint-based and environment reward-based. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using Pac-Man and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.

Inverse reinforcement learning in contextual MDPs

Machine Learning ◽

10.1007/s10994-021-05984-x ◽

2021 ◽

Author(s):

Stav Belogolovsky ◽

Philip Korsunsky ◽

Shie Mannor ◽

Chen Tessler ◽

Tom Zahavy

Keyword(s):

Reinforcement Learning ◽

Optimization Problem ◽

Decision Processes ◽

Convex Optimization Problem ◽

Reward Function ◽

Dynamic Treatment Regime ◽

Markov Decision ◽

Dynamic Treatment ◽

Recorded Data

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Dealing with multiple experts and non-stationarity in inverse reinforcement learning: an application to real-life problems

Machine Learning ◽

10.1007/s10994-020-05939-8 ◽

2021 ◽

Author(s):

Amarildo Likmeta ◽

Alberto Maria Metelli ◽

Giorgia Ramponi ◽

Andrea Tirinzoni ◽

Matteo Giuliani ◽

...

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Real Life ◽

User Preferences ◽

Water Release ◽

Reward Function ◽

Model Free ◽

Conflicting Objectives ◽

Multiple Experts

AbstractIn real-world applications, inferring the intentions of expert agents (e.g., human operators) can be fundamental to understand how possibly conflicting objectives are managed, helping to interpret the demonstrated behavior. In this paper, we discuss how inverse reinforcement learning (IRL) can be employed to retrieve the reward function implicitly optimized by expert agents acting in real applications. Scaling IRL to real-world cases has proved challenging as typically only a fixed dataset of demonstrations is available and further interactions with the environment are not allowed. For this reason, we resort to a class of truly batch model-free IRL algorithms and we present three application scenarios: (1) the high-level decision-making problem in the highway driving scenario, and (2) inferring the user preferences in a social network (Twitter), and (3) the management of the water release in the Como Lake. For each of these scenarios, we provide formalization, experiments and a discussion to interpret the obtained results.