Teaching AI Agents Ethical Values Using Reinforcement Learning and Policy Orchestration

Author(s):  
Ritesh Noothigattu ◽  
Djallel Bouneffouf ◽  
Nicholas Mattei ◽  
Rachita Chandra ◽  
Piyush Madan ◽  
...  

Autonomous cyber-physical agents play an increasingly large role in our lives. To ensure that they behave in ways aligned with the values of society, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. We detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations and reinforcement learning to learn to maximize environmental rewards. A contextual bandit-based orchestrator then picks between the two policies: constraint-based and environment reward-based. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward-maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using Pac-Man and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.

2017 ◽  
Vol 137 (4) ◽  
pp. 667-673
Author(s):  
Shinji Tomita ◽  
Fumiya Hamatsu ◽  
Tomoki Hamagami

2018 ◽  
Vol 51 (18) ◽  
pp. 31-36 ◽  
Author(s):  
Yuan Wang ◽  
Kirubakaran Velswamy ◽  
Biao Huang

2021 ◽  
Author(s):  
Stav Belogolovsky ◽  
Philip Korsunsky ◽  
Shie Mannor ◽  
Chen Tessler ◽  
Tom Zahavy

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.


2021 ◽  
Author(s):  
Amarildo Likmeta ◽  
Alberto Maria Metelli ◽  
Giorgia Ramponi ◽  
Andrea Tirinzoni ◽  
Matteo Giuliani ◽  
...  

AbstractIn real-world applications, inferring the intentions of expert agents (e.g., human operators) can be fundamental to understand how possibly conflicting objectives are managed, helping to interpret the demonstrated behavior. In this paper, we discuss how inverse reinforcement learning (IRL) can be employed to retrieve the reward function implicitly optimized by expert agents acting in real applications. Scaling IRL to real-world cases has proved challenging as typically only a fixed dataset of demonstrations is available and further interactions with the environment are not allowed. For this reason, we resort to a class of truly batch model-free IRL algorithms and we present three application scenarios: (1) the high-level decision-making problem in the highway driving scenario, and (2) inferring the user preferences in a social network (Twitter), and (3) the management of the water release in the Como Lake. For each of these scenarios, we provide formalization, experiments and a discussion to interpret the obtained results.


Algorithms ◽  
2021 ◽  
Vol 14 (1) ◽  
pp. 26
Author(s):  
Yiran Xue ◽  
Rui Wu ◽  
Jiafeng Liu ◽  
Xianglong Tang

Existing crowd evacuation guidance systems require the manual design of models and input parameters, incurring a significant workload and a potential for errors. This paper proposed an end-to-end intelligent evacuation guidance method based on deep reinforcement learning, and designed an interactive simulation environment based on the social force model. The agent could automatically learn a scene model and path planning strategy with only scene images as input, and directly output dynamic signage information. Aiming to solve the “dimension disaster” phenomenon of the deep Q network (DQN) algorithm in crowd evacuation, this paper proposed a combined action-space DQN (CA-DQN) algorithm that grouped Q network output layer nodes according to action dimensions, which significantly reduced the network complexity and improved system practicality in complex scenes. In this paper, the evacuation guidance system is defined as a reinforcement learning agent and implemented by the CA-DQN method, which provides a novel approach for the evacuation guidance problem. The experiments demonstrate that the proposed method is superior to the static guidance method, and on par with the manually designed model method.


2021 ◽  
Vol 9 (7) ◽  
pp. 767
Author(s):  
Shin-Pyo Choi ◽  
Jae-Ung Lee ◽  
Jun-Bum Park

The enlargement of ships has increased the relative hull deformation owing to draft changes. Moreover, design changes such as an increased propeller diameter and pitch changes have occurred to compensate for the reduction in the engine revolution and consequent ship speed. In terms of propulsion shaft alignment, as the load of the stern tube support bearing increases, an uneven load distribution occurs between the shaft support bearings, leading to stern accidents. To prevent such accidents and to ensure shaft system stability, a shaft system design technique is required in which the shaft deformation resulting from the hull deformation is considered. Based on the measurement data of a medium-sized oil/chemical tanker, this study presents a novel approach to predicting the shaft deformation following stern hull deformation through inverse analysis using deep reinforcement learning, as opposed to traditional prediction techniques. The main bearing reaction force, which was difficult to reflect in previous studies, was predicted with high accuracy by comparing it with the measured value, and reasonable shaft deformation could be derived according to the hull deformation. The deep reinforcement learning technique in this study is expected to be expandable for predicting the dynamic behavior of the shaft of an operating vessel.


Sign in / Sign up

Export Citation Format

Share Document