New Approach in Human-AI Interaction by Reinforcement-Imitation Learning

Reinforcement Learning (RL) provides effective results with an agent learning from a stand-alone reward function. However, it presents unique challenges with large amounts of environment states and action spaces, as well as in the determination of rewards. Imitation Learning (IL) offers a promising solution for those challenges using a teacher. In IL, the learning process can take advantage of human-sourced assistance and/or control over the agent and environment. A human teacher and an agent learner are considered in this study. The teacher takes part in the agent’s training towards dealing with the environment, tackling a specific objective, and achieving a predefined goal. This paper proposes a novel approach combining IL with different types of RL methods, namely, state-action-reward-state-action (SARSA) and Asynchronous Advantage Actor–Critic Agents (A3C), to overcome the problems of both stand-alone systems. How to effectively leverage the teacher’s feedback—be it direct binary or indirect detailed—for the agent learner to learn sequential decision-making policies is addressed. The results of this study on various OpenAI-Gym environments show that this algorithmic method can be incorporated with different combinations, and significantly decreases both human endeavors and tedious exploration process.

Download Full-text

Episodic Self-Imitation Learning with Hindsight

Electronics ◽

10.3390/electronics9101742 ◽

2020 ◽

Vol 9 (10) ◽

pp. 1742

Author(s):

Tianhong Dai ◽

Hengyan Liu ◽

Anil Anthony Bharath

Keyword(s):

Learning Algorithm ◽

Imitation Learning ◽

Continuous Control ◽

State Action ◽

Good State ◽

Agent Learning ◽

Comparable Performance ◽

Experience Replay ◽

Speed Up ◽

Action Spaces

Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good state–action pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation.

Download Full-text

HRLB⌃2: A Reinforcement Learning Based Framework for Believable Bots

Applied Sciences ◽

10.3390/app8122453 ◽

2018 ◽

Vol 8 (12) ◽

pp. 2453 ◽

Cited By ~ 5

Author(s):

Christian Arzate Cruz ◽

Jorge Ramirez Uresti

Keyword(s):

Reinforcement Learning ◽

High Dimensional ◽

State Action ◽

Hierarchical Reinforcement Learning ◽

Learning Framework ◽

Novel Approach ◽

The Creation ◽

Action Spaces ◽

Human Player

The creation of believable behaviors for Non-Player Characters (NPCs) is key to improve the players’ experience while playing a game. To achieve this objective, we need to design NPCs that appear to be controlled by a human player. In this paper, we propose a hierarchical reinforcement learning framework for believable bots (HRLB⌃2). This novel approach has been designed so it can overcome two main challenges currently faced in the creation of human-like NPCs. The first difficulty is exploring domains with high-dimensional state–action spaces, while satisfying constraints imposed by traits that characterize human-like behavior. The second problem is generating behavior diversity, by also adapting to the opponent’s playing style. We evaluated the effectiveness of our framework in the domain of the 2D fighting game named Street Fighter IV. The results of our tests demonstrate that our bot behaves in a human-like manner.

Download Full-text

Keeping in Touch with Collaborative UAVs: A Deep Reinforcement Learning Approach

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/78 ◽

2018 ◽

Cited By ~ 3

Author(s):

Bo Yang ◽

Min Liu

Keyword(s):

Reinforcement Learning ◽

Training Procedure ◽

Message Delivery ◽

Uncertain Environments ◽

State Action ◽

Network Connection ◽

Agent Learning ◽

Multi Agent ◽

Continuous Domains ◽

Action Spaces

Effective collaborations among autonomous unmanned aerial vehicles (UAVs) rely on timely information sharing. However, the time-varying flight environment and the intermittent link connectivity pose great challenges to message delivery. In this paper, we leverage the deep reinforcement learning (DRL) technique to address the UAVs' optimal links discovery and selection problem in uncertain environments. As the multi-agent learning efficiency is constrained by the high-dimensional and continuous action spaces, we slice the whole action spaces into a number of tractable fractions to achieve efficient convergences of optimal policies in continuous domains. Moreover, for the nonstationarity issue that particularly challenges the multi-agent DRL with local perceptions, we present a multi-agent mutual sampling method that jointly interacts the intra-agent and inter-agent state-action information to stabilize and expedite the training procedure. We evaluate the proposed algorithm on the UAVs' continuous network connection task. Results show that the associated UAVs can quickly select the optimal connected links, which facilitate the UAVs' teamwork significantly.

Download Full-text

Preceding vehicle following algorithm with human driving characteristics

Proceedings of the Institution of Mechanical Engineers Part D Journal of Automobile Engineering ◽

10.1177/0954407020981546 ◽

2021 ◽

pp. 095440702098154

Author(s):

Feng Pan ◽

Hong Bao

Keyword(s):

Reinforcement Learning ◽

Weight Vector ◽

Gradient Algorithm ◽

Inner Product ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Human Driver ◽

Policy Gradient ◽

Preceding Vehicle ◽

Action Spaces

This paper proposes a new approach of using reinforcement learning (RL) to train an agent to perform the task of vehicle following with human driving characteristics. We refer to the ideal of inverse reinforcement learning to design the reward function of the RL model. The factors that need to be weighed in vehicle following were vectorized into reward vectors, and the reward function was defined as the inner product of the reward vector and weights. Driving data of human drivers was collected and analyzed to obtain the true reward function. The RL model was trained with the deterministic policy gradient algorithm because the state and action spaces are continuous. We adjusted the weight vector of the reward function so that the value vector of the RL model could continuously approach that of a human driver. After dozens of rounds of training, we selected the policy with the nearest value vector to that of a human driver and tested it in the PanoSim simulation environment. The results showed the desired performance for the task of an agent following the preceding vehicle safely and smoothly.

Download Full-text

Criticality Aware Scheduler Prioritization in Virtual Orthogonal Multichannel Parallelism for 5G Cellular Network

10.21203/rs.3.rs-621287/v1 ◽

2021 ◽

Author(s):

Dan Ye

Keyword(s):

Cellular Network ◽

Dynamic Control ◽

Low Complexity ◽

Network Control ◽

Data Link ◽

State Action ◽

Novel Approach ◽

Optimal Network ◽

Network Control Systems ◽

Application Requirements

Abstract To fulfill various enhanced requirements of next generation wireless access, 5 G cellular network will drive towards higher energy efficiency, lower latency and higher reliable wireless networks. The key contributions can summarize as follows: (1) proposes frame-based max-weight scheduling (FMWS) with reconfiguration delay in combination of round–robin algorithm can dynamically control throughput and delay. The frame-based dynamic control (FBDC) policy is applicable to 5 G cellular network control systems in data link layer, provides a new framework for developing throughput-optimal network control policies using state-action frequencies. (2) proposes a novel approach in MAC layer--Virtual multichannels Parallelism Carrier Sense Multiple Access (VMCP-CSMA) which can compute a set of TDDM schedules for multiple channels at once rather than computing one schedule at a time and constantly switching or recomputing schedules. (3) proposes a novel criticality aware scheduler prioritization in VMCP-CSMA policy can reorder a set of TDDM schedules based on max-weight scheduling with reconfiguration delay according to different application requirements. It can achieve high throughput and low delay with low complexity compared with other schedulings.

Download Full-text

Beam division multiple access for millimeter wave massive MIMO: Hybrid zero-forcing beamforming with user selection

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v12i1.pp445-452 ◽

2022 ◽

Vol 12 (1) ◽

pp. 445

Author(s):

Hong Son Vu ◽

Kien Truong ◽

Minh Thuy Le

Keyword(s):

Multiple Access ◽

Mimo Systems ◽

Multiple Input Multiple Output ◽

Base Station ◽

Zero Forcing ◽

Multiuser Interference ◽

Novel Approach ◽

Multiple Input ◽

Input Multiple Output ◽

Promising Solution

<p>Massive multiple-input multiple-output (MIMO) systems are considered a promising solution to minimize multiuser interference (MUI) based on simple precoding techniques with a massive antenna array at a base station (BS). This paper presents a novel approach of beam division multiple access (BDMA) which BS transmit signals to multiusers at the same time via different beams based on hybrid beamforming and user-beam schedule. With the selection of users whose steering vectors are orthogonal to each other, interference between users is significantly improved. While, the efficiency spectrum of proposed scheme reaches to the performance of fully digital solutions, the multiuser interference is considerably reduced.</p>

Download Full-text

Data-driven planning via imitation learning

The International Journal of Robotics Research ◽

10.1177/0278364918781001 ◽

2018 ◽

Vol 37 (13-14) ◽

pp. 1632-1672 ◽

Cited By ~ 4

Author(s):

Sanjiban Choudhury ◽

Mohak Bhardwaj ◽

Sankalp Arora ◽

Ashish Kapoor ◽

Gireeja Ranade ◽

...

Keyword(s):

Partial Information ◽

State Of The Art ◽

Imitation Learning ◽

Data Driven ◽

Sequential Decision ◽

Efficient Manner ◽

World State ◽

Performance Guarantees ◽

The World ◽

Planning Problems

Robot planning is the process of selecting a sequence of actions that optimize for a task=specific objective. For instance, the objective for a navigation task would be to find collision-free paths, whereas the objective for an exploration task would be to map unknown areas. The optimal solutions to such tasks are heavily influenced by the implicit structure in the environment, i.e. the configuration of objects in the world. State-of-the-art planning approaches, however, do not exploit this structure, thereby expending valuable effort searching the action space instead of focusing on potentially good actions. In this paper, we address the problem of enabling planners to adapt their search strategies by inferring such good actions in an efficient manner using only the information uncovered by the search up until that time. We formulate this as a problem of sequential decision making under uncertainty where at a given iteration a planning policy must map the state of the search to a planning action. Unfortunately, the training process for such partial-information-based policies is slow to converge and susceptible to poor local minima. Our key insight is that if we could fully observe the underlying world map, we would easily be able to disambiguate between good and bad actions. We hence present a novel data-driven imitation learning framework to efficiently train planning policies by imitating a clairvoyant oracle: an oracle that at train time has full knowledge about the world map and can compute optimal decisions. We leverage the fact that for planning problems, such oracles can be efficiently computed and derive performance guarantees for the learnt policy. We examine two important domains that rely on partial-information-based policies: informative path planning and search-based motion planning. We validate the approach on a spectrum of environments for both problem domains, including experiments on a real UAV, and show that the learnt policy consistently outperforms state-of-the-art algorithms. Our framework is able to train policies that achieve up to [Formula: see text] more reward than state-of-the art information-gathering heuristics and a [Formula: see text] speedup as compared with A* on search-based planning problems. Our approach paves the way forward for applying data-driven techniques to other such problem domains under the umbrella of robot planning.

Download Full-text

A Design of Reward Function Based on Knowledge in Multi-agent Learning

Advanced Data Mining and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-540-88192-6_61 ◽

2008 ◽

pp. 596-603

Author(s):

Bo Fan ◽

Jiexin Pu

Keyword(s):

Reward Function ◽

Agent Learning ◽

Multi Agent

Download Full-text

Machine Teaching for Inverse Reinforcement Learning: Algorithms and Applications

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33017749 ◽

2019 ◽

Vol 33 ◽

pp. 7749-7758

Author(s):

Daniel S. Brown ◽

Scott Niekum

Keyword(s):

Reinforcement Learning ◽

Set Cover ◽

Sequential Decision ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Set Cover Problem ◽

Efficient Approximation Algorithm ◽

Minimum Number ◽

Teaching Problem ◽

Novel Applications

Inverse reinforcement learning (IRL) infers a reward function from demonstrations, allowing for policy improvement and generalization. However, despite much recent interest in IRL, little work has been done to understand the minimum set of demonstrations needed to teach a specific sequential decisionmaking task. We formalize the problem of finding maximally informative demonstrations for IRL as a machine teaching problem where the goal is to find the minimum number of demonstrations needed to specify the reward equivalence class of the demonstrator. We extend previous work on algorithmic teaching for sequential decision-making tasks by showing a reduction to the set cover problem which enables an efficient approximation algorithm for determining the set of maximallyinformative demonstrations. We apply our proposed machine teaching algorithm to two novel applications: providing a lower bound on the number of queries needed to learn a policy using active IRL and developing a novel IRL algorithm that can learn more efficiently from informative demonstrations than a standard IRL approach.

Download Full-text

EXISTENCE OF OPTIMAL STATIONARY POLICIES IN FINITE DYNAMIC PROGRAMS WITH NONNEGATIVE REWARDS

Probability in the Engineering and Informational Sciences ◽

10.1017/s0269964801154082 ◽

2001 ◽

Vol 15 (4) ◽

pp. 557-564 ◽

Cited By ~ 1

Author(s):

Rolando Cavazos-Cadena ◽

Raúl Montes-de-Oca

Keyword(s):

Control Policy ◽

Stationary Policy ◽

Reward Function ◽

Total Reward ◽

Dynamic Programs ◽

Finite State ◽

Markov Decision ◽

Optimal Stationary Policy ◽

Action Spaces ◽

Discounted Criterion

This article concerns Markov decision chains with finite state and action spaces, and a control policy is graded via the expected total-reward criterion associated to a nonnegative reward function. Within this framework, a classical theorem guarantees the existence of an optimal stationary policy whenever the optimal value function is finite, a result that is obtained via a limit process using the discounted criterion. The objective of this article is to present an alternative approach, based entirely on the properties of the expected total-reward index, to establish such an existence result.

Download Full-text