Real-World Reinforcement Learning: Observations from Two Successful Cases

34th Bled eConference Digital Support from Crisis to Progressive Change: Conference Proceedings ◽

10.18690/978-961-286-485-9.20 ◽

2021 ◽

Author(s):

Philipp Back ◽

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Secondary Data ◽

Optimal Strategies ◽

Sequential Decision ◽

Agent Behavior ◽

Reward Function ◽

Data Center Cooling ◽

Safety Constraints ◽

Order Execution

Reinforcement Learning (RL) is a machine learning technique that enables artificial agents to learn optimal strategies for sequential decision-making problems. RL has achieved superhuman performance in artificial domains, yet real-world applications remain rare. We explore the drivers of successful RL adoption for solving practical business problems. We rely on publicly available secondary data on two cases: data center cooling at Google and trade order execution at JPMorgan. We perform thematic analysis using a pre-defined coding framework based on the known challenges to real-world RL by DulacArnold, Mankowitz, & Hester (2019). First, we find that RL works best when the problem dynamics can be simulated. Second, the ability to encode the desired agent behavior as a reward function is critical. Third, safety constraints are often necessary in the context of trial-and-error learning. Our work is amongst the first in Information Systems to discuss the practical business value of the emerging AI subfield of RL.

Download Full-text

Dealing with multiple experts and non-stationarity in inverse reinforcement learning: an application to real-life problems

Machine Learning ◽

10.1007/s10994-020-05939-8 ◽

2021 ◽

Author(s):

Amarildo Likmeta ◽

Alberto Maria Metelli ◽

Giorgia Ramponi ◽

Andrea Tirinzoni ◽

Matteo Giuliani ◽

...

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Real Life ◽

User Preferences ◽

Inverse Reinforcement Learning ◽

Water Release ◽

Reward Function ◽

Model Free ◽

Conflicting Objectives ◽

Multiple Experts

AbstractIn real-world applications, inferring the intentions of expert agents (e.g., human operators) can be fundamental to understand how possibly conflicting objectives are managed, helping to interpret the demonstrated behavior. In this paper, we discuss how inverse reinforcement learning (IRL) can be employed to retrieve the reward function implicitly optimized by expert agents acting in real applications. Scaling IRL to real-world cases has proved challenging as typically only a fixed dataset of demonstrations is available and further interactions with the environment are not allowed. For this reason, we resort to a class of truly batch model-free IRL algorithms and we present three application scenarios: (1) the high-level decision-making problem in the highway driving scenario, and (2) inferring the user preferences in a social network (Twitter), and (3) the management of the water release in the Como Lake. For each of these scenarios, we provide formalization, experiments and a discussion to interpret the obtained results.

Download Full-text

Reinforcement Learning Approaches in Social Robotics

Sensors ◽

10.3390/s21041292 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1292

Author(s):

Neziha Akalin ◽

Amy Loutfi

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Social Robotics ◽

Research Field ◽

Social Robots ◽

Learning Approaches ◽

Reward Function ◽

Optimal Behavior ◽

Learning Challenges ◽

Starting Point

This article surveys reinforcement learning approaches in social robotics. Reinforcement learning is a framework for decision-making problems in which an agent interacts through trial-and-error with its environment to discover an optimal behavior. Since interaction is a key component in both reinforcement learning and social robotics, it can be a well-suited approach for real-world interactions with physically embodied social robots. The scope of the paper is focused particularly on studies that include social physical robots and real-world human-robot interactions with users. We present a thorough analysis of reinforcement learning approaches in social robotics. In addition to a survey, we categorize existent reinforcement learning approaches based on the used method and the design of the reward mechanisms. Moreover, since communication capability is a prominent feature of social robots, we discuss and group the papers based on the communication medium used for reward formulation. Considering the importance of designing the reward function, we also provide a categorization of the papers based on the nature of the reward. This categorization includes three major themes: interactive reinforcement learning, intrinsically motivated methods, and task performance-driven methods. The benefits and challenges of reinforcement learning in social robotics, evaluation methods of the papers regarding whether or not they use subjective and algorithmic measures, a discussion in the view of real-world reinforcement learning challenges and proposed solutions, the points that remain to be explored, including the approaches that have thus far received less attention is also given in the paper. Thus, this paper aims to become a starting point for researchers interested in using and applying reinforcement learning methods in this particular research field.

Download Full-text

Machine Teaching for Inverse Reinforcement Learning: Algorithms and Applications

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33017749 ◽

2019 ◽

Vol 33 ◽

pp. 7749-7758

Author(s):

Daniel S. Brown ◽

Scott Niekum

Keyword(s):

Reinforcement Learning ◽

Set Cover ◽

Sequential Decision ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Set Cover Problem ◽

Efficient Approximation Algorithm ◽

Minimum Number ◽

Teaching Problem ◽

Novel Applications

Inverse reinforcement learning (IRL) infers a reward function from demonstrations, allowing for policy improvement and generalization. However, despite much recent interest in IRL, little work has been done to understand the minimum set of demonstrations needed to teach a specific sequential decisionmaking task. We formalize the problem of finding maximally informative demonstrations for IRL as a machine teaching problem where the goal is to find the minimum number of demonstrations needed to specify the reward equivalence class of the demonstrator. We extend previous work on algorithmic teaching for sequential decision-making tasks by showing a reduction to the set cover problem which enables an efficient approximation algorithm for determining the set of maximallyinformative demonstrations. We apply our proposed machine teaching algorithm to two novel applications: providing a lower bound on the number of queries needed to learn a policy using active IRL and developing a novel IRL algorithm that can learn more efficiently from informative demonstrations than a standard IRL approach.

Download Full-text

Reward Learning for Efficient Reinforcement Learning in Extractive Document Summarisation

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/326 ◽

2019 ◽

Cited By ~ 1

Author(s):

Yang Gao ◽

Christian M. Meyer ◽

Mohsen Mesgar ◽

Iryna Gurevych

Keyword(s):

Reinforcement Learning ◽

Learning To Rank ◽

Poor Performance ◽

Parameter Tuning ◽

Test Time ◽

Sequential Decision ◽

Time Data ◽

Training Time ◽

Search Spaces ◽

Reward Function

Document summarisation can be formulated as a sequential decision-making problem, which can be solved by Reinforcement Learning (RL) algorithms. The predominant RL paradigm for summarisation learns a cross-input policy, which requires considerable time, data and parameter tuning due to the huge search spaces and the delayed rewards. Learning input-specific RL policies is a more efficient alternative, but so far depends on handcrafted rewards, which are difficult to design and yield poor performance. We propose RELIS, a novel RL paradigm that learns a reward function with Learning-to-Rank (L2R) algorithms at training time and uses this reward function to train an input-specific RL policy at test time. We prove that RELIS guarantees to generate near-optimal summaries with appropriate L2R and RL algorithms. Empirically, we evaluate our approach on extractive multi-document summarisation. We show that RELIS reduces the training time by two orders of magnitude compared to the state-of-the-art models while performing on par with them.

Download Full-text

Objective-aware Traffic Simulation via Inverse Reinforcement Learning

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/519 ◽

2021 ◽

Author(s):

Guanjie Zheng ◽

Hanyang Liu ◽

Kai Xu ◽

Zhenhui Li

Keyword(s):

Reinforcement Learning ◽

Physical Model ◽

Real World ◽

Traffic Simulation ◽

Transportation Systems ◽

Superior Performance ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Car Following Model ◽

Real World Datasets

Traffic simulators act as an essential component in the operating and planning of transportation systems. Conventional traffic simulators usually employ a calibrated physical car-following model to describe vehicles' behaviors and their interactions with traffic environment. However, there is no universal physical model that can accurately predict the pattern of vehicle's behaviors in different situations. A fixed physical model tends to be less effective in a complicated environment given the non-stationary nature of traffic dynamics. In this paper, we formulate traffic simulation as an inverse reinforcement learning problem, and propose a parameter sharing adversarial inverse reinforcement learning model for dynamics-robust simulation learning. Our proposed model is able to imitate a vehicle's trajectories in the real world while simultaneously recovering the reward function that reveals the vehicle's true objective which is invariant to different dynamics. Extensive experiments on synthetic and real-world datasets show the superior performance of our approach compared to state-of-the-art methods and its robustness to variant dynamics of traffic.

Download Full-text

On the use of the policy gradient and Hessian in inverse reinforcement learning

Intelligenza Artificiale ◽

10.3233/ia-180011 ◽

2020 ◽

Vol 14 (1) ◽

pp. 117-150

Author(s):

Alberto Maria Metelli ◽

Matteo Pirotta ◽

Marcello Restelli

Keyword(s):

Reinforcement Learning ◽

Sequential Decision ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Model Free ◽

Learning Speed ◽

Policy Gradient ◽

Continuous Domains ◽

Learning Policies ◽

Finite Domains

Reinforcement Learning (RL) is an effective approach to solve sequential decision making problems when the environment is equipped with a reward function to evaluate the agent’s actions. However, there are several domains in which a reward function is not available and difficult to estimate. When samples of expert agents are available, Inverse Reinforcement Learning (IRL) allows recovering a reward function that explains the demonstrated behavior. Most of the classic IRL methods, in addition to expert’s demonstrations, require sampling the environment to evaluate each reward function, that, in turn, is built starting from a set of engineered features. This paper is about a novel model-free IRL approach that does not require to specify a function space where to search for the expert’s reward function. Leveraging on the fact that the policy gradient needs to be zero for an optimal policy, the algorithm generates an approximation space for the reward function, in which a reward is singled out employing a second-order criterion. After introducing our approach for finite domains, we extend it to continuous ones. The empirical results, on both finite and continuous domains, show that the reward function recovered by our algorithm allows learning policies that outperform those obtained with the true reward function, in terms of learning speed.

Download Full-text

Image Visual Sensor Used in Health-Care Navigation in Indoor Scenes Using Deep Reinforcement Learning (DRL) and Control Sensor Robot for Patients Data Health Information

Journal of Medical Imaging and Health Informatics ◽

10.1166/jmihi.2021.3283 ◽

2021 ◽

Vol 11 (1) ◽

pp. 104-113

Author(s):

Walead Kaled Seaman ◽

Sırma Yavuz

Keyword(s):

Deep Learning ◽

Reinforcement Learning ◽

Active Agent ◽

Visual Navigation ◽

Sequential Decision ◽

Visual Sensor ◽

Reward Function ◽

Training Samples ◽

Indoor Scenes ◽

And Control

Compared with traditional motion planners and deep reinforcement learning DRL has been applied more and more widely to achieving sequential behaviors control of movement robots in internal environment. There are two addressed issues of deep learning. The inability to generalize to achieve set of goals. The data inefficiency, that is, the model requires, many trial and error loops (often costly). Applied can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging. In this paper, we address these two issues and apply the proposed model to visual navigation in conformity with generalizing in conformity with obtaining new goals (target-driven). To tackle the first issue, we advise an actor-critic mannequin whose coverage is a feature of the intention as much properly namely the present day state, which approves higher generalization. To tackle the second issue, we advocate the 3D scenes in environment indoor simulation is AI2-THOR framework, who provides a surrounding including tremendous with high-quality 3D scenes and a physics engine. Our framework allows agents according to receive actions and have interaction with objects. Hence, we are able to accumulate an enormous number of training samples successfully with sequential decision making based totally on the RL framework. Particularly, Healthcare and medicine stand to benefit immensely from deep learning because of the sheer volume of data being generated we used the behavioral cloning approach, who enables the active agent to storeroom an expert (or mentor) policy except for the utilization of reward function stability or generalizes across targets.

Download Full-text

Skill-based curiosity for intrinsically motivated reinforcement learning

Machine Learning ◽

10.1007/s10994-019-05845-8 ◽

2019 ◽

Vol 109 (3) ◽

pp. 493-512 ◽

Cited By ~ 2

Author(s):

Nicolas Bougie ◽

Ryutaro Ichise

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Skill Learning ◽

High Dimensional ◽

Sequential Decision ◽

Learning Methods ◽

Reward Function ◽

Intrinsic Reward ◽

Reinforcement Learning Models ◽

Data Efficiency

Abstract Reinforcement learning methods rely on rewards provided by the environment that are extrinsic to the agent. However, many real-world scenarios involve sparse or delayed rewards. In such cases, the agent can develop its own intrinsic reward function called curiosity to enable the agent to explore its environment in the quest of new skills. We propose a novel end-to-end curiosity mechanism for deep reinforcement learning methods, that allows an agent to gradually acquire new skills. Our method scales to high-dimensional problems, avoids the need of directly predicting the future, and, can perform in sequential decision scenarios. We formulate the curiosity as the ability of the agent to predict its own knowledge about the task. We base the prediction on the idea of skill learning to incentivize the discovery of new skills, and guide exploration towards promising solutions. To further improve data efficiency and generalization of the agent, we propose to learn a latent representation of the skills. We present a variety of sparse reward tasks in MiniGrid, MuJoCo, and Atari games. We compare the performance of an augmented agent that uses our curiosity reward to state-of-the-art learners. Experimental evaluation exhibits higher performance compared to reinforcement learning models that only learn by maximizing extrinsic rewards.

Download Full-text

Building Personalized Simulator for Interactive Search

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/710 ◽

2019 ◽

Cited By ~ 1

Author(s):

Qianlong Liu ◽

Baoliang Cui ◽

Zhongyu Wei ◽

Baolin Peng ◽

Haikuan Huang ◽

...

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Decision Problem ◽

Online Training ◽

Sequential Decision ◽

Information Need ◽

Interactive Search ◽

Search Results ◽

Two Stages ◽

Cold Start Problem

Interactive search, where a set of tags is recommended to users together with search results at each turn, is an effective way to guide users to identify their information need. It is a classical sequential decision problem and the reinforcement learning based agent can be introduced as a solution. The training of the agent can be divided into two stages, i.e., offline and online. Existing reinforcement learning based systems tend to perform the offline training in a supervised way based on historical labeled data while the online training is performed via reinforcement learning algorithms based on interactions with real users. The mis-match between online and offline training leads to a cold-start problem for the online usage of the agent. To address this issue, we propose to employ a simulator to mimic the environment for the offline training of the agent. Users' profiles are considered to build a personalized simulator, besides, model-based approach is used to train the simulator and is able to use the data efficiently. Experimental results based on real-world dataset demonstrate the effectiveness of our agent and personalized simulator.

Download Full-text

Fast reinforcement learning with generalized policy updates

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1907370117 ◽

2020 ◽

Vol 117 (48) ◽

pp. 30079-30087 ◽

Cited By ~ 2

Author(s):

André Barreto ◽

Shaobo Hou ◽

Diana Borsa ◽

David Silver ◽

Doina Precup

Keyword(s):

Reinforcement Learning ◽

Divide And Conquer ◽

Problem Decomposition ◽

Sequential Decision ◽

Learning Problem ◽

Reward Function ◽

Speed Up ◽

Complex Decision ◽

Reward Functions ◽

Do So

The combination of reinforcement learning with deep learning is a promising approach to tackle important sequential decision-making problems that are currently intractable. One obstacle to overcome is the amount of data needed by learning systems of this type. In this article, we propose to address this issue through a divide-and-conquer approach. We argue that complex decision problems can be naturally decomposed into multiple tasks that unfold in sequence or in parallel. By associating each task with a reward function, this problem decomposition can be seamlessly accommodated within the standard reinforcement-learning formalism. The specific way we do so is through a generalization of two fundamental operations in reinforcement learning: policy improvement and policy evaluation. The generalized version of these operations allow one to leverage the solution of some tasks to speed up the solution of others. If the reward function of a task can be well approximated as a linear combination of the reward functions of tasks previously solved, we can reduce a reinforcement-learning problem to a simpler linear regression. When this is not the case, the agent can still exploit the task solutions by using them to interact with and learn about the environment. Both strategies considerably reduce the amount of data needed to solve a reinforcement-learning problem.

Download Full-text