An adaptive deep reinforcement learning framework enables curling robots with human-like performance in real-world conditions

The game of curling can be considered a good test bed for studying the interaction between artificial intelligence systems and the real world. In curling, the environmental characteristics change at every moment, and every throw has an impact on the outcome of the match. Furthermore, there is no time for relearning during a curling match due to the timing rules of the game. Here, we report a curling robot that can achieve human-level performance in the game of curling using an adaptive deep reinforcement learning framework. Our proposed adaptation framework extends standard deep reinforcement learning using temporal features, which learn to compensate for the uncertainties and nonstationarities that are an unavoidable part of curling. Our curling robot, Curly, was able to win three of four official matches against expert human teams [top-ranked women’s curling teams and Korea national wheelchair curling team (reserve team)]. These results indicate that the gap between physics-based simulators and the real world can be narrowed.

Download Full-text

How to train your robot with deep reinforcement learning: lessons we have learned

The International Journal of Robotics Research ◽

10.1177/0278364920987859 ◽

2021 ◽

pp. 027836492098785

Author(s):

Julian Ibarz ◽

Jie Tan ◽

Chelsea Finn ◽

Mrinal Kalakrishnan ◽

Peter Pastor ◽

...

Keyword(s):

Machine Learning ◽

Reinforcement Learning ◽

Case Studies ◽

Real World ◽

Review Article ◽

The Real ◽

Complex Skills ◽

Real World Learning ◽

Level Sensor ◽

Embodied Agent

Deep reinforcement learning (RL) has emerged as a promising approach for autonomously acquiring complex behaviors from low-level sensor observations. Although a large portion of deep RL research has focused on applications in video games and simulated control, which does not connect with the constraints of learning in real environments, deep RL has also demonstrated promise in enabling physical robots to learn complex skills in the real world. At the same time, real-world robotics provides an appealing domain for evaluating such algorithms, as it connects directly to how humans learn: as an embodied agent in the real world. Learning to perceive and move in the real world presents numerous challenges, some of which are easier to address than others, and some of which are often not considered in RL research that focuses only on simulated domains. In this review article, we present a number of case studies involving robotic deep RL. Building off of these case studies, we discuss commonly perceived challenges in deep RL and how they have been addressed in these works. We also provide an overview of other outstanding challenges, many of which are unique to the real-world robotics setting and are not often the focus of mainstream RL research. Our goal is to provide a resource both for roboticists and machine learning researchers who are interested in furthering the progress of deep RL in the real world.

Download Full-text

Predicting Human Mobility with Reinforcement-Learning-Based Long-Term Periodicity Modeling

ACM Transactions on Intelligent Systems and Technology ◽

10.1145/3469860 ◽

2021 ◽

Vol 12 (6) ◽

pp. 1-23

Author(s):

Shuo Tao ◽

Jingang Jiang ◽

Defu Lian ◽

Kai Zheng ◽

Enhong Chen

Keyword(s):

Reinforcement Learning ◽

Human Mobility ◽

Recurrent Network ◽

Mobility Prediction ◽

Learning Framework ◽

Temporal Features ◽

Wide Range ◽

Spatio Temporal ◽

Historical Trajectory

Mobility prediction plays an important role in a wide range of location-based applications and services. However, there are three problems in the existing literature: (1) explicit high-order interactions of spatio-temporal features are not systemically modeled; (2) most existing algorithms place attention mechanisms on top of recurrent network, so they can not allow for full parallelism and are inferior to self-attention for capturing long-range dependence; (3) most literature does not make good use of long-term historical information and do not effectively model the long-term periodicity of users. To this end, we propose MoveNet and RLMoveNet. MoveNet is a self-attention-based sequential model, predicting each user’s next destination based on her most recent visits and historical trajectory. MoveNet first introduces a cross-based learning framework for modeling feature interactions. With self-attention on both the most recent visits and historical trajectory, MoveNet can use an attention mechanism to capture the user’s long-term regularity in a more efficient way. Based on MoveNet, to model long-term periodicity more effectively, we add the reinforcement learning layer and named RLMoveNet. RLMoveNet regards the human mobility prediction as a reinforcement learning problem, using the reinforcement learning layer as the regularization part to drive the model to pay attention to the behavior with periodic actions, which can help us make the algorithm more effective. We evaluate both of them with three real-world mobility datasets. MoveNet outperforms the state-of-the-art mobility predictor by around 10% in terms of accuracy, and simultaneously achieves faster convergence and over 4x training speedup. Moreover, RLMoveNet achieves higher prediction accuracy than MoveNet, which proves that modeling periodicity explicitly from the perspective of reinforcement learning is more effective.

Download Full-text

Using Inverse Reinforcement Learning with Real Trajectories to Get More Trustworthy Pedestrian Simulations

Mathematics ◽

10.3390/math8091479 ◽

2020 ◽

Vol 8 (9) ◽

pp. 1479

Author(s):

Francisco Martinez-Gil ◽

Miguel Lozano ◽

Ignacio García-Fernández ◽

Pau Romero ◽

Dolors Serra ◽

...

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Machine Learning Techniques ◽

Inverse Reinforcement Learning ◽

The Real ◽

Q Learning ◽

Learning Framework ◽

Entropy Principle ◽

Real Behavior ◽

Function Approximator

Reinforcement learning is one of the most promising machine learning techniques to get intelligent behaviors for embodied agents in simulations. The output of the classic Temporal Difference family of Reinforcement Learning algorithms adopts the form of a value function expressed as a numeric table or a function approximator. The learned behavior is then derived using a greedy policy with respect to this value function. Nevertheless, sometimes the learned policy does not meet expectations, and the task of authoring is difficult and unsafe because the modification of one value or parameter in the learned value function has unpredictable consequences in the space of the policies it represents. This invalidates direct manipulation of the learned value function as a method to modify the derived behaviors. In this paper, we propose the use of Inverse Reinforcement Learning to incorporate real behavior traces in the learning process to shape the learned behaviors, thus increasing their trustworthiness (in terms of conformance to reality). To do so, we adapt the Inverse Reinforcement Learning framework to the navigation problem domain. Specifically, we use Soft Q-learning, an algorithm based on the maximum causal entropy principle, with MARL-Ped (a Reinforcement Learning-based pedestrian simulator) to include information from trajectories of real pedestrians in the process of learning how to navigate inside a virtual 3D space that represents the real environment. A comparison with the behaviors learned using a Reinforcement Learning classic algorithm (Sarsa(λ)) shows that the Inverse Reinforcement Learning behaviors adjust significantly better to the real trajectories.

Download Full-text

Deep reinforcement learning for map-less goal-driven robot navigation

International Journal of Advanced Robotic Systems ◽

10.1177/1729881421992621 ◽

2021 ◽

Vol 18 (1) ◽

pp. 172988142199262

Author(s):

Matej Dobrevski ◽

Danijel Skočaj

Keyword(s):

Reinforcement Learning ◽

Path Planning ◽

Mobile Robots ◽

Real World ◽

Robot Navigation ◽

Dynamic Environments ◽

Learning Framework ◽

Local Navigation ◽

Real Robot ◽

Navigation Method

Mobile robots that operate in real-world environments need to be able to safely navigate their surroundings. Obstacle avoidance and path planning are crucial capabilities for achieving autonomy of such systems. However, for new or dynamic environments, navigation methods that rely on an explicit map of the environment can be impractical or even impossible to use. We present a new local navigation method for steering the robot to global goals without relying on an explicit map of the environment. The proposed navigation model is trained in a deep reinforcement learning framework based on Advantage Actor–Critic method and is able to directly translate robot observations to movement commands. We evaluate and compare the proposed navigation method with standard map-based approaches on several navigation scenarios in simulation and demonstrate that our method is able to navigate the robot also without the map or when the map gets corrupted, while the standard approaches fail. We also show that our method can be directly transferred to a real robot.

Download Full-text

Rule Extraction by Structural Learning with an Immediate Critic

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.1999.p0341 ◽

1999 ◽

Vol 3 (5) ◽

pp. 341-347 ◽

Cited By ~ 1

Author(s):

Masumi Ishikawa ◽

Keyword(s):

Neural Networks ◽

Reinforcement Learning ◽

Supervised Learning ◽

Real World ◽

Back Propagation ◽

Rule Extraction ◽

Distributed Representation ◽

Structural Learning ◽

The Real ◽

Training Samples

Studies on rule extraction using neural networks have exclusively adopted supervised learning, in which correct outputs are always given as training samples. The real world, however, does not always provide correct answers. We advocate the use of learning with an immediate critic, which is simple reinforcement learning. It uses an immediate binary reinforcement signal indicating whether or not an output is correct. This, of course, makes learning more difficult and time-consuming than supervised learning. Learning with an immediate critic alone, however, is not powerful enough in extracting rules from data because distributed representation emerges just as in back propagation learning. We propose to combine learning with an immediate critic and structural learning with forgetting (SLF) - structural learning with an immediate critic and forgetting (SLCF). A procedure of rule extraction from data by SLCF is similar to that by SLF. Applications of the proposed method to rule extraction from lenses data demonstrate its effectiveness.

Download Full-text

Beyond Griefing

Law and Order in Virtual Worlds - Advances in Human and Social Aspects of Technology ◽

10.4018/978-1-61520-795-4.ch009 ◽

2010 ◽

pp. 183-197

Author(s):

Angela Adrian

Keyword(s):

Criminal Law ◽

Virtual Worlds ◽

Real World ◽

Criminal Activity ◽

The Real ◽

The Law ◽

Set Up ◽

Rules Of The Game ◽

Better Than

Because there is so much money involved in virtual worlds these days, there has been an increase in criminal activity in these worlds as well. The gaming community calls people who promote conflict “griefers”. Griefers are people who like nothing better than to kill team-mates or obstruct the game’s objectives. Griefers scam, cheat and abuse. Recently, the have begun to set up Ponzi schemes. In games that attempt to encourage complex and enduring interactions among thousands of players, “griefing” has evolved from being an isolated nuisance to a social disease. Much in the same way crime has become the real world’s social disease. Grief is turning into crime. Some consider virtual worlds to be a game and therefore outside the realms of real law and merely subject to the rules of the game. However, some virtual worlds have become an increasingly important as a method of commerce and means of communication. In most circumstance the law is reluctant to intrude into the rules of the game, but it will do if necessary. (Lastowka & Hunter, 2004) Criminal law applies in virtual worlds as it does in the real world, but not necessarily in the manner that a player would expect or want. The law looks at the real consequences of actions, not the on-screen representations. (Kennedy, 2009)

Download Full-text

Adaptive Reinforcement Learning Integrating Exploitation-and Exploration-oriented Learning

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.1999.p0474 ◽

1999 ◽

Vol 3 (6) ◽

pp. 474-478

Author(s):

Satoshi Kurihara ◽

◽

Rikio Onai ◽

Toshiharu Sugawara ◽

Keyword(s):

Reinforcement Learning ◽

Mobile Robots ◽

Real World ◽

Large Scale ◽

Autonomous Systems ◽

Learning System ◽

The Internet ◽

Changing Environment ◽

The Real ◽

Exploitation And Exploration

We propose and evaluate an adaptive reinforcement learning system that integrates both exploitation- and exploration-oriented learning (ArLee). Compared to conventional reinforcement learning, ArLee is more robust in a dynamically changing environment and conducts exploration-oriented learning efficiently even in a large-scale environment. It is thus well suited for autonomous systems, for example, software agents and mobile robots, that operate in dynamic, large-scale environments, such as the real world and the Internet. Simulation demonstrates the learning system’s basic effectiveness.

Download Full-text

Virtual-Taobao: Virtualizing Real-World Online Retail Environment for Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014902 ◽

2019 ◽

Vol 33 ◽

pp. 4902-4909 ◽

Cited By ~ 6

Author(s):

Jing-Cheng Shi ◽

Yang Yu ◽

Qing Da ◽

Shi-Yong Chen ◽

An-Xiang Zeng

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Physical Environment ◽

Physical World ◽

Online Retail ◽

The Real ◽

Policy Model ◽

Physical Environments ◽

Sampling Cost ◽

Norm Constraint

Applying reinforcement learning in physical-world tasks is extremely challenging. It is commonly infeasible to sample a large number of trials, as required by current reinforcement learning methods, in a physical environment. This paper reports our project on using reinforcement learning for better commodity search in Taobao, one of the largest online retail platforms and meanwhile a physical environment with a high sampling cost. Instead of training reinforcement learning in Taobao directly, we present our environment-building approach: we build Virtual-Taobao, a simulator learned from historical customer behavior data, and then we train policies in Virtual-Taobao with no physical sampling costs. To improve the simulation precision, we propose GAN-SD (GAN for Simulating Distributions) for customer feature generation with better matched distribution; we propose MAIL (Multiagent Adversarial Imitation Learning) for generating better generalizable customer actions. To further avoid overfitting the imperfection of the simulator, we propose ANC (Action Norm Constraint) strategy to regularize the policy model. In experiments, Virtual-Taobao is trained from hundreds of millions of real Taobao customers’ records. Compared with the real Taobao, Virtual-Taobao faithfully recovers important properties of the real environment. We further show that the policies trained purely in Virtual-Taobao, which has zero physical sampling cost, can have significantly superior real-world performance to the traditional supervised approaches, through online A/B tests. We hope this work may shed some light on applying reinforcement learning in complex physical environments.

Download Full-text

Mastering Complex Control in MOBA Games with Deep Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6144 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6672-6679 ◽

Cited By ~ 1

Author(s):

Deheng Ye ◽

Zhao Liu ◽

Mingfei Sun ◽

Bei Shi ◽

Peilin Zhao ◽

...

Keyword(s):

Reinforcement Learning ◽

Large Scale ◽

Action Control ◽

Learning Problem ◽

Complex Action ◽

Complex Control ◽

Learning Framework ◽

High Scalability ◽

Action Spaces ◽

Level Performance

We study the reinforcement learning problem of complex action control in the Multi-player Online Battle Arena (MOBA) 1v1 games. This problem involves far more complicated state and action spaces than those of traditional 1v1 games, such as Go and Atari series, which makes it very difficult to search any policies with human-level performance. In this paper, we present a deep reinforcement learning framework to tackle this problem from the perspectives of both system and algorithm. Our system is of low coupling and high scalability, which enables efficient explorations at large scale. Our algorithm includes several novel strategies, including control dependency decoupling, action mask, target attention, and dual-clip PPO, with which our proposed actor-critic network can be effectively trained in our system. Tested on the MOBA game Honor of Kings, the trained AI agents can defeat top professional human players in full 1v1 games.

Download Full-text

Reinforcement learning in the real world

2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541) ◽

10.1109/ijcnn.2004.1380847 ◽

2005 ◽

Cited By ~ 5

Author(s):

A.G. Barto

Keyword(s):

Reinforcement Learning ◽

Real World ◽

The Real

Download Full-text