Estimation of Reward Function and Policy of Silkmoth Odor Source Searching by Inverse Reinforcement Learning

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Download Full-text

Dealing with multiple experts and non-stationarity in inverse reinforcement learning: an application to real-life problems

Machine Learning ◽

10.1007/s10994-020-05939-8 ◽

2021 ◽

Author(s):

Amarildo Likmeta ◽

Alberto Maria Metelli ◽

Giorgia Ramponi ◽

Andrea Tirinzoni ◽

Matteo Giuliani ◽

...

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Real Life ◽

User Preferences ◽

Inverse Reinforcement Learning ◽

Water Release ◽

Reward Function ◽

Model Free ◽

Conflicting Objectives ◽

Multiple Experts

AbstractIn real-world applications, inferring the intentions of expert agents (e.g., human operators) can be fundamental to understand how possibly conflicting objectives are managed, helping to interpret the demonstrated behavior. In this paper, we discuss how inverse reinforcement learning (IRL) can be employed to retrieve the reward function implicitly optimized by expert agents acting in real applications. Scaling IRL to real-world cases has proved challenging as typically only a fixed dataset of demonstrations is available and further interactions with the environment are not allowed. For this reason, we resort to a class of truly batch model-free IRL algorithms and we present three application scenarios: (1) the high-level decision-making problem in the highway driving scenario, and (2) inferring the user preferences in a social network (Twitter), and (3) the management of the water release in the Como Lake. For each of these scenarios, we provide formalization, experiments and a discussion to interpret the obtained results.

Download Full-text

Deep Inverse Reinforcement Learning for Reward Function Identification in Bidding Models

IEEE Transactions on Power Systems ◽

10.1109/tpwrs.2021.3076296 ◽

2021 ◽

pp. 1-1

Author(s):

Hongye Guo ◽

Qixin Chen ◽

Qing Xia ◽

Chongqing Kang

Keyword(s):

Reinforcement Learning ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Function Identification ◽

Bidding Models

Download Full-text

Preceding vehicle following algorithm with human driving characteristics

Proceedings of the Institution of Mechanical Engineers Part D Journal of Automobile Engineering ◽

10.1177/0954407020981546 ◽

2021 ◽

pp. 095440702098154

Author(s):

Feng Pan ◽

Hong Bao

Keyword(s):

Reinforcement Learning ◽

Weight Vector ◽

Gradient Algorithm ◽

Inner Product ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Human Driver ◽

Policy Gradient ◽

Preceding Vehicle ◽

Action Spaces

This paper proposes a new approach of using reinforcement learning (RL) to train an agent to perform the task of vehicle following with human driving characteristics. We refer to the ideal of inverse reinforcement learning to design the reward function of the RL model. The factors that need to be weighed in vehicle following were vectorized into reward vectors, and the reward function was defined as the inner product of the reward vector and weights. Driving data of human drivers was collected and analyzed to obtain the true reward function. The RL model was trained with the deterministic policy gradient algorithm because the state and action spaces are continuous. We adjusted the weight vector of the reward function so that the value vector of the RL model could continuously approach that of a human driver. After dozens of rounds of training, we selected the policy with the nearest value vector to that of a human driver and tested it in the PanoSim simulation environment. The results showed the desired performance for the task of an agent following the preceding vehicle safely and smoothly.

Download Full-text

Safety-aware Adversarial Inverse Reinforcement Learning (S-AIRL) for Highway Autonomous Driving

Journal of Autonomous Vehicles and Systems ◽

10.1115/1.4053427 ◽

2022 ◽

pp. 1-14

Author(s):

Fangjian Li ◽

John R Wagner ◽

Yue Wang

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Risky Behaviors ◽

Autonomous Driving ◽

Inverse Reinforcement Learning ◽

Safety Issues ◽

Reward Function ◽

Sampling Process ◽

Safety Awareness ◽

Driving Scenario

Abstract Inverse reinforcement learning (IRL) has been successfully applied in many robotics and autonomous driving studies without the need for hand-tuning a reward function. However, it suffers from safety issues. Compared to the reinforcement learning (RL) algorithms, IRL is even more vulnerable to unsafe situations as it can only infer the importance of safety based on expert demonstrations. In this paper, we propose a safety-aware adversarial inverse reinforcement learning algorithm (S-AIRL). First, the control barrier function (CBF) is used to guide the training of a safety critic, which leverages the knowledge of system dynamics in the sampling process without training an additional guiding policy. The trained safety critic is then integrated into the discriminator to help discern the generated data and expert demonstrations from the standpoint of safety. Finally, to further improve the safety awareness, a regulator is introduced in the loss function of the discriminator training to prevent the recovered reward function from assigning high rewards to the risky behaviors. We tested our S-AIRL in the highway autonomous driving scenario. Comparing to the original AIRL algorithm, with the same level of imitation learning (IL) performance, the proposed S-AIRL can reduce the collision rate by 32.6%.

Download Full-text

Deep inverse reinforcement learning for structural evolution of small molecules

Briefings in Bioinformatics ◽

10.1093/bib/bbaa364 ◽

2020 ◽

Author(s):

Brighter Agyemang ◽

Wei-Ping Wu ◽

Daniel Addo ◽

Michael Y Kpiebaareh ◽

Ebenezer Nanor ◽

...

Keyword(s):

Reinforcement Learning ◽

High Throughput Screening ◽

Structural Evolution ◽

Search Space ◽

New Drugs ◽

Inverse Reinforcement Learning ◽

Generative Adversarial Network ◽

Entropy Maximization ◽

Reward Function ◽

Adversarial Network

Abstract The size and quality of chemical libraries to the drug discovery pipeline are crucial for developing new drugs or repurposing existing drugs. Existing techniques such as combinatorial organic synthesis and high-throughput screening usually make the process extraordinarily tough and complicated since the search space of synthetically feasible drugs is exorbitantly huge. While reinforcement learning has been mostly exploited in the literature for generating novel compounds, the requirement of designing a reward function that succinctly represents the learning objective could prove daunting in certain complex domains. Generative adversarial network-based methods also mostly discard the discriminator after training and could be hard to train. In this study, we propose a framework for training a compound generator and learn a transferable reward function based on the entropy maximization inverse reinforcement learning (IRL) paradigm. We show from our experiments that the IRL route offers a rational alternative for generating chemical compounds in domains where reward function engineering may be less appealing or impossible while data exhibiting the desired objective is readily available.

Download Full-text

Machine Teaching for Inverse Reinforcement Learning: Algorithms and Applications

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33017749 ◽

2019 ◽

Vol 33 ◽

pp. 7749-7758

Author(s):

Daniel S. Brown ◽

Scott Niekum

Keyword(s):

Reinforcement Learning ◽

Set Cover ◽

Sequential Decision ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Set Cover Problem ◽

Efficient Approximation Algorithm ◽

Minimum Number ◽

Teaching Problem ◽

Novel Applications

Inverse reinforcement learning (IRL) infers a reward function from demonstrations, allowing for policy improvement and generalization. However, despite much recent interest in IRL, little work has been done to understand the minimum set of demonstrations needed to teach a specific sequential decisionmaking task. We formalize the problem of finding maximally informative demonstrations for IRL as a machine teaching problem where the goal is to find the minimum number of demonstrations needed to specify the reward equivalence class of the demonstrator. We extend previous work on algorithmic teaching for sequential decision-making tasks by showing a reduction to the set cover problem which enables an efficient approximation algorithm for determining the set of maximallyinformative demonstrations. We apply our proposed machine teaching algorithm to two novel applications: providing a lower bound on the number of queries needed to learn a policy using active IRL and developing a novel IRL algorithm that can learn more efficiently from informative demonstrations than a standard IRL approach.

Download Full-text

Generative Adversarial Immitation Learning for Steering an Unmanned Surface Vehicle

Proceedings of the Northern Lights Deep Learning Workshop ◽

10.7557/18.5147 ◽

2020 ◽

Vol 1 ◽

pp. 6

Author(s):

Alexandra Vedeler ◽

Narada Warakagoda

Keyword(s):

Reinforcement Learning ◽

Obstacle Avoidance ◽

Complex Dynamics ◽

Imitation Learning ◽

Inverse Reinforcement Learning ◽

Radar Sensor ◽

Single Action ◽

Reward Function ◽

End To End ◽

Insight Into

The task of obstacle avoidance using maritime vessels, such as Unmanned Surface Vehicles (USV), has traditionally been solved using specialized modules that are designed and optimized separately. However, this approach requires a deep insight into the environment, the vessel, and their complex dynamics. We propose an alternative method using Imitation Learning (IL) through Deep Reinforcement Learning (RL) and Deep Inverse Reinforcement Learning (IRL) and present a system that learns an end-to-end steering model capable of mapping radar-like images directly to steering actions in an obstacle avoidance scenario. The USV used in the work is equipped with a Radar sensor and we studied the problem of generating a single action parameter, heading. We apply an IL algorithm known as generative adversarial imitation learning (GAIL) to develop an end-to-end steering model for a scenario where avoidance of an obstacle is the goal. The performance of the system was studied for different design choices and compared to that of a system that is based on pure RL. The IL system produces results that indicate it is able to grasp the concept of the task and that in many ways are on par with the RL system. We deem this to be promising for future use in tasks that are not as easily described by a reward function.

Download Full-text

Reward Function Considering Ambiguity Using Inverse Reinforcement Learning and Fuzzy Reasoning

Journal of Japan Society for Fuzzy Theory and Intelligent Informatics ◽

10.3156/jsoft.33.4_827 ◽

2021 ◽

Vol 33 (4) ◽

pp. 827-832

Author(s):

Yuta KATO ◽

Masayoshi KANOH ◽

Tsuyoshi NAKAMURA

Keyword(s):

Reinforcement Learning ◽

Fuzzy Reasoning ◽

Inverse Reinforcement Learning ◽

Reward Function

Download Full-text

Toward Diverse Text Generation with Inverse Reinforcement Learning

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/606 ◽

2018 ◽

Cited By ~ 5

Author(s):

Zhan Shi ◽

Xinchi Chen ◽

Xipeng Qiu ◽

Xuanjing Huang

Keyword(s):

Reinforcement Learning ◽

Generative Models ◽

Training Data ◽

Great Success ◽

Text Generation ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Total Reward ◽

Policy Gradient ◽

Adversarial Models

Text generation is a crucial task in NLP. Recently, several adversarial generative models have been proposed to improve the exposure bias problem in text generation. Though these models gain great success, they still suffer from the problems of reward sparsity and mode collapse. In order to address these two problems, in this paper, we employ inverse reinforcement learning (IRL) for text generation. Specifically, the IRL framework learns a reward function on training data, and then an optimal policy to maximum the expected total reward. Similar to the adversarial models, the reward and policy function in IRL are optimized alternately. Our method has two advantages: (1) the reward function can produce more dense reward signals. (2) the generation policy, trained by ``entropy regularized'' policy gradient, encourages to generate more diversified texts. Experiment results demonstrate that our proposed method can generate higher quality texts than the previous methods.

Download Full-text