Reinforcement learning of motor skills using Policy Search and human corrective advice

Robot learning problems are limited by physical constraints, which make learning successful policies for complex motor skills on real systems unfeasible. Some reinforcement learning methods, like Policy Search, offer stable convergence toward locally optimal solutions, whereas interactive machine learning or learning-from-demonstration methods allow fast transfer of human knowledge to the agents. However, most methods require expert demonstrations. In this work, we propose the use of human corrective advice in the actions domain for learning motor trajectories. Additionally, we combine this human feedback with reward functions in a Policy Search learning scheme. The use of both sources of information speeds up the learning process, since the intuitive knowledge of the human teacher can be easily transferred to the agent, while the Policy Search method with the cost/reward function take over for supervising the process and reducing the influence of occasional wrong human corrections. This interactive approach has been validated for learning movement primitives with simulated arms with several degrees of freedom in reaching via-point movements, and also using real robots in such tasks as “writing characters” and the ball-in-a-cup game. Compared with standard reinforcement learning without human advice, the results show that the proposed method not only converges to higher rewards when learning movement primitives, but also that the learning is sped up by a factor of 4–40 times, depending on the task.

Download Full-text

Motion Planning with Energy Reduction for a Floating Robotic Platform Under Disturbances and Measurement Noise Using Reinforcement Learning

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213018600059 ◽

2018 ◽

Vol 27 (04) ◽

pp. 1860005 ◽

Cited By ~ 1

Author(s):

Konstantinos Tziortziotis ◽

Nikolaos Tziortziotis ◽

Kostas Vlachos ◽

Konstantinos Blekas

Keyword(s):

Reinforcement Learning ◽

Energy Consumption ◽

Degrees Of Freedom ◽

Optimal Path ◽

Iteration Scheme ◽

Measurement Noise ◽

Target Point ◽

Level Control ◽

Reward Function ◽

Marine Platform

This paper investigates the use of reinforcement learning for the navigation of an over-actuated, i.e. more control inputs than degrees of freedom, marine platform in unknown environment. The proposed approach uses an online least-squared policy iteration scheme for value function approximation in order to estimate optimal policy, in conjunction with a low-level control system that controls the magnitude of the linear velocity, and the orientation of the platform. Primary goal of the proposed scheme is the reduction of the consumed energy. To that end, we propose a variable reward function that depends on the energy consumption of the platform. We evaluate our approach in a complex and realistic simulation environment and report results concerning its performance on estimating optimal navigation policies under different environmental disturbances, and position GPS measurement noise. The proposed framework is compared, in terms of energy consumption, to a baseline approach based on virtual potential fields. The results show that the marine platform successfully discovers the target point following a sub-optimal path, maintaining reduced energy consumption.

Download Full-text

Deep Reinforcement Learning for Adaptive Traffic Signal Control

Volume 3, Rapid Fire Interactive Presentations: Advances in Control Systems; Advances in Robotics and Mechatronics; Automotive and Transportation Systems; Motion Planning and Trajectory Tracking; Soft Mechatronic Actuators and Sensors; Unmanned Ground and Aerial Vehicles ◽

10.1115/dscc2019-9076 ◽

2019 ◽

Cited By ~ 1

Author(s):

Kai Liang Tan ◽

Subhadipto Poddar ◽

Soumik Sarkar ◽

Anuj Sharma

Keyword(s):

Reinforcement Learning ◽

Real Life ◽

Traffic Signal ◽

Sensor Data ◽

Signal Control ◽

Traffic Signal Control ◽

Adaptive Controllers ◽

Physical Constraints ◽

Reward Function ◽

Adaptive Traffic Signal Control

Abstract Many existing traffic signal controllers are either simple adaptive controllers based on sensors placed around traffic intersections, or optimized by traffic engineers on a fixed schedule. Optimizing traffic controllers is time consuming and usually require experienced traffic engineers. Recent research has demonstrated the potential of using deep reinforcement learning (DRL) in this context. However, most of the studies do not consider realistic settings that could seamlessly transition into deployment. In this paper, we propose a DRL-based adaptive traffic signal control framework that explicitly considers realistic traffic scenarios, sensors, and physical constraints. In this framework, we also propose a novel reward function that shows significantly improved traffic performance compared to the typical baseline pre-timed and fully-actuated traffic signals controllers. The framework is implemented and validated on a simulation platform emulating real-life traffic scenarios and sensor data streams.

Download Full-text

Motion generation for walking exoskeleton robot using multiple dynamic movement primitives sequences combined with reinforcement learning

Robotica ◽

10.1017/s0263574721001934 ◽

2022 ◽

pp. 1-16

Author(s):

Peng Zhang ◽

Junxia Zhang

Keyword(s):

Reinforcement Learning ◽

Degrees Of Freedom ◽

Stability Margin ◽

Joint Space ◽

Motion Generation ◽

Movement Primitives ◽

Dynamic Movement ◽

Exoskeleton Robot ◽

Flexion Extension ◽

Dynamic Movement Primitives

Abstract In order to assist patients with lower limb disabilities in normal walking, a new trajectory learning scheme of limb exoskeleton robot based on dynamic movement primitives (DMP) combined with reinforcement learning (RL) was proposed. The developed exoskeleton robot has six degrees of freedom (DOFs). The hip and knee of each artificial leg can provide two electric-powered DOFs for flexion/extension. And two passive-installed DOFs of the ankle were used to achieve the motion of inversion/eversion and plantarflexion/dorsiflexion. The five-point segmented gait planning strategy is proposed to generate gait trajectories. The gait Zero Moment Point stability margin is used as a parameter to construct a stability criteria to ensure the stability of human-exoskeleton system. Based on the segmented gait trajectory planning formation strategy, the multiple-DMP sequences were proposed to model the generation trajectories. Meanwhile, in order to eliminate the effect of uncertainties in joint space, the RL was adopted to learn the trajectories. The experiment demonstrated that the proposed scheme can effectively remove interferences and uncertainties.

Download Full-text

Are More Profiles Better Than Fewer?: Searching for Parsimony and Relevance in Stated Choice Experiments

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/1719-22 ◽

2000 ◽

Vol 1719 (1) ◽

pp. 165-174 ◽

Cited By ~ 13

Author(s):

Peter R. Stopher ◽

David A. Hensher

Keyword(s):

New Zealand ◽

Degrees Of Freedom ◽

Choice Model ◽

Fractional Factorial ◽

Sources Of Information ◽

Stated Choice ◽

Complex Design ◽

Data Collection Instrument ◽

Main Effects ◽

Better Than

Transportation planners increasingly include a stated choice (SC) experiment as part of the armory of empirical sources of information on how individuals respond to current and potential travel contexts. The accumulated experience with SC data has been heavily conditioned on analyst prejudices about the acceptable complexity of the data collection instrument, especially the number of profiles (or treatments) given to each sampled individual (and the number of attributes and alternatives to be processed). It is not uncommon for transport demand modelers to impose stringent limitations on the complexity of an SC experiment. A review of the marketing and transport literature suggests that little is known about the basis for rejecting complex designs or accepting simple designs. Although more complex designs provide the analyst with increasing degrees of freedom in the estimation of models, facilitating nonlinearity in main effects and independent two-way interactions, it is not clear what the overall behavioral gains are in increasing the number of treatments. A complex design is developed as the basis for a stated choice study, producing a fractional factorial of 32 rows. The fraction is then truncated by administering 4, 8, 16, 24, and 32 profiles to a sample of 166 individuals (producing 1, 016 treatments) in Australia and New Zealand faced with the decision to fly (or not to fly) between Australia and New Zealand by either Qantas or Ansett under alternative fare regimes. Statistical comparisons of elasticities (an appropriate behavioral basis for comparisons) suggest that the empirical gains within the context of a linear specification of the utility expression associated with each alternative in a discrete choice model may be quite marginal.

Download Full-text

EFFICIENT COST MANAGEMENT AS AN IMPORTANT FACTOR OF INCREASING THE COMPETITIVENESS OF THE ECONOMIC SUBJECT

АГРАРНЫЙ ВЕСТНИК ВЕРХНЕВОЛЖЬЯ ◽

10.35523/2307-5872-2020-32-3-122-130 ◽

2020 ◽

Keyword(s):

Management System ◽

Structural Model ◽

Driving Forces ◽

Cost Management ◽

Management Accounting ◽

Practical Significance ◽

Research Centre ◽

Experimental Investigations ◽

Sources Of Information ◽

Interactive Approach

The research is actual, has a theoretical and applied nature. Theoretical developments at cost management branch of economic organization were illustrated by practical examples. In the work the follow methods are used: abstract-logical, methods of induction and deduction, system and situate approaches, methods of comparative analysis, analysis of breakeven, monographic analysis. The follow sources of information were used: literature, the results of the experimental investigations, carried out in Verkh-nevolzhsky Federal Agrarian Research Centre, observations, carried out at an agricultural organization. Theoretical structural model of cost management was created in a functional way, relationship between elements were designated in it, as well as driving forces of its realization were determined. Such terms as “cost management ” and «management accounting» were specified. The paradigm of a relation to cost process was described. Approaches to realization of cost management system of organization on the basis offlexibility principle were developed. The most important of them are: application of applied programs packages and special program means for computer, organization of feedback, account of functional relationship of cost with production results, cost accounting on elements, places of origin, carriers and centers of responsibility, as well as interactive approach. Some examples which show possibilities of using some instruments of cost management and management accounting for generation of management solutions were described. These instruments are: analysis and planning on the basis of standards of constant and variable costs and flexible cost management in interrelations to other subsystems in management system of organization (in this case it is with technology management. The work is of theoretical and practical significance.

Download Full-text

Inverse reinforcement learning in contextual MDPs

Machine Learning ◽

10.1007/s10994-021-05984-x ◽

2021 ◽

Author(s):

Stav Belogolovsky ◽

Philip Korsunsky ◽

Shie Mannor ◽

Chen Tessler ◽

Tom Zahavy

Keyword(s):

Reinforcement Learning ◽

Optimization Problem ◽

Decision Processes ◽

Inverse Reinforcement Learning ◽

Convex Optimization Problem ◽

Reward Function ◽

Dynamic Treatment Regime ◽

Markov Decision ◽

Dynamic Treatment ◽

Recorded Data

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Download Full-text

Dealing with multiple experts and non-stationarity in inverse reinforcement learning: an application to real-life problems

Machine Learning ◽

10.1007/s10994-020-05939-8 ◽

2021 ◽

Author(s):

Amarildo Likmeta ◽

Alberto Maria Metelli ◽

Giorgia Ramponi ◽

Andrea Tirinzoni ◽

Matteo Giuliani ◽

...

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Real Life ◽

User Preferences ◽

Inverse Reinforcement Learning ◽

Water Release ◽

Reward Function ◽

Model Free ◽

Conflicting Objectives ◽

Multiple Experts

AbstractIn real-world applications, inferring the intentions of expert agents (e.g., human operators) can be fundamental to understand how possibly conflicting objectives are managed, helping to interpret the demonstrated behavior. In this paper, we discuss how inverse reinforcement learning (IRL) can be employed to retrieve the reward function implicitly optimized by expert agents acting in real applications. Scaling IRL to real-world cases has proved challenging as typically only a fixed dataset of demonstrations is available and further interactions with the environment are not allowed. For this reason, we resort to a class of truly batch model-free IRL algorithms and we present three application scenarios: (1) the high-level decision-making problem in the highway driving scenario, and (2) inferring the user preferences in a social network (Twitter), and (3) the management of the water release in the Como Lake. For each of these scenarios, we provide formalization, experiments and a discussion to interpret the obtained results.

Download Full-text

Integrating Production Planning with Truck-Dispatching Decisions through Reinforcement Learning While Managing Uncertainty

Minerals ◽

10.3390/min11060587 ◽

2021 ◽

Vol 11 (6) ◽

pp. 587

Author(s):

Joao Pedro de Carvalho ◽

Roussos Dimitrakopoulos

Keyword(s):

Reinforcement Learning ◽

Discrete Event ◽

Mining Operations ◽

Fixed Sequence ◽

Q Learning ◽

Reward Function ◽

Copper Gold ◽

Mining Complex ◽

Learning Reinforcement ◽

Operational Plan

This paper presents a new truck dispatching policy approach that is adaptive given different mining complex configurations in order to deliver supply material extracted by the shovels to the processors. The method aims to improve adherence to the operational plan and fleet utilization in a mining complex context. Several sources of operational uncertainty arising from the loading, hauling and dumping activities can influence the dispatching strategy. Given a fixed sequence of extraction of the mining blocks provided by the short-term plan, a discrete event simulator model emulates the interaction arising from these mining operations. The continuous repetition of this simulator and a reward function, associating a score value to each dispatching decision, generate sample experiences to train a deep Q-learning reinforcement learning model. The model learns from past dispatching experience, such that when a new task is required, a well-informed decision can be quickly taken. The approach is tested at a copper–gold mining complex, characterized by uncertainties in equipment performance and geological attributes, and the results show improvements in terms of production targets, metal production, and fleet management.

Download Full-text

Real-Time Safety Optimization of Connected Vehicle Trajectories Using Reinforcement Learning

Sensors ◽

10.3390/s21113864 ◽

2021 ◽

Vol 21 (11) ◽

pp. 3864

Author(s):

Tarek Ghoul ◽

Tarek Sayed

Keyword(s):

Reinforcement Learning ◽

Real Time ◽

Low Cost ◽

Safety Evaluation ◽

Traffic Volume ◽

Connected Vehicles ◽

Connected Vehicle ◽

Real World Data ◽

Physical Constraints ◽

Traffic Conflicts

Speed advisories are used on highways to inform vehicles of upcoming changes in traffic conditions and apply a variable speed limit to reduce traffic conflicts and delays. This study applies a similar concept to intersections with respect to connected vehicles to provide dynamic speed advisories in real-time that guide vehicles towards an optimum speed. Real-time safety evaluation models for signalized intersections that depend on dynamic traffic parameters such as traffic volume and shock wave characteristics were used for this purpose. The proposed algorithm incorporates a rule-based approach alongside a Deep Deterministic Policy Gradient reinforcement learning technique (DDPG) to assign ideal speeds for connected vehicles at intersections and improve safety. The system was tested on two intersections using real-world data and yielded an average reduction in traffic conflicts ranging from 9% to 23%. Further analysis was performed to show that the algorithm yields tangible results even at lower market penetration rates (MPR). The algorithm was tested on the same intersection with different traffic volume conditions as well as on another intersection with different physical constraints and characteristics. The proposed algorithm provides a low-cost approach that is not computationally intensive and works towards optimizing for safety by reducing rear-end traffic conflicts.

Download Full-text

Deep Inverse Reinforcement Learning for Reward Function Identification in Bidding Models

IEEE Transactions on Power Systems ◽

10.1109/tpwrs.2021.3076296 ◽

2021 ◽

pp. 1-1

Author(s):

Hongye Guo ◽

Qixin Chen ◽

Qing Xia ◽

Chongqing Kang

Keyword(s):

Reinforcement Learning ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Function Identification ◽

Bidding Models

Download Full-text