scholarly journals Reinforcement learning of motor skills using Policy Search and human corrective advice

2019 ◽  
Vol 38 (14) ◽  
pp. 1560-1580 ◽  
Author(s):  
Carlos Celemin ◽  
Guilherme Maeda ◽  
Javier Ruiz-del-Solar ◽  
Jan Peters ◽  
Jens Kober

Robot learning problems are limited by physical constraints, which make learning successful policies for complex motor skills on real systems unfeasible. Some reinforcement learning methods, like Policy Search, offer stable convergence toward locally optimal solutions, whereas interactive machine learning or learning-from-demonstration methods allow fast transfer of human knowledge to the agents. However, most methods require expert demonstrations. In this work, we propose the use of human corrective advice in the actions domain for learning motor trajectories. Additionally, we combine this human feedback with reward functions in a Policy Search learning scheme. The use of both sources of information speeds up the learning process, since the intuitive knowledge of the human teacher can be easily transferred to the agent, while the Policy Search method with the cost/reward function take over for supervising the process and reducing the influence of occasional wrong human corrections. This interactive approach has been validated for learning movement primitives with simulated arms with several degrees of freedom in reaching via-point movements, and also using real robots in such tasks as “writing characters” and the ball-in-a-cup game. Compared with standard reinforcement learning without human advice, the results show that the proposed method not only converges to higher rewards when learning movement primitives, but also that the learning is sped up by a factor of 4–40 times, depending on the task.

2018 ◽  
Vol 27 (04) ◽  
pp. 1860005 ◽  
Author(s):  
Konstantinos Tziortziotis ◽  
Nikolaos Tziortziotis ◽  
Kostas Vlachos ◽  
Konstantinos Blekas

This paper investigates the use of reinforcement learning for the navigation of an over-actuated, i.e. more control inputs than degrees of freedom, marine platform in unknown environment. The proposed approach uses an online least-squared policy iteration scheme for value function approximation in order to estimate optimal policy, in conjunction with a low-level control system that controls the magnitude of the linear velocity, and the orientation of the platform. Primary goal of the proposed scheme is the reduction of the consumed energy. To that end, we propose a variable reward function that depends on the energy consumption of the platform. We evaluate our approach in a complex and realistic simulation environment and report results concerning its performance on estimating optimal navigation policies under different environmental disturbances, and position GPS measurement noise. The proposed framework is compared, in terms of energy consumption, to a baseline approach based on virtual potential fields. The results show that the marine platform successfully discovers the target point following a sub-optimal path, maintaining reduced energy consumption.


Author(s):  
Kai Liang Tan ◽  
Subhadipto Poddar ◽  
Soumik Sarkar ◽  
Anuj Sharma

Abstract Many existing traffic signal controllers are either simple adaptive controllers based on sensors placed around traffic intersections, or optimized by traffic engineers on a fixed schedule. Optimizing traffic controllers is time consuming and usually require experienced traffic engineers. Recent research has demonstrated the potential of using deep reinforcement learning (DRL) in this context. However, most of the studies do not consider realistic settings that could seamlessly transition into deployment. In this paper, we propose a DRL-based adaptive traffic signal control framework that explicitly considers realistic traffic scenarios, sensors, and physical constraints. In this framework, we also propose a novel reward function that shows significantly improved traffic performance compared to the typical baseline pre-timed and fully-actuated traffic signals controllers. The framework is implemented and validated on a simulation platform emulating real-life traffic scenarios and sensor data streams.


Robotica ◽  
2022 ◽  
pp. 1-16
Author(s):  
Peng Zhang ◽  
Junxia Zhang

Abstract In order to assist patients with lower limb disabilities in normal walking, a new trajectory learning scheme of limb exoskeleton robot based on dynamic movement primitives (DMP) combined with reinforcement learning (RL) was proposed. The developed exoskeleton robot has six degrees of freedom (DOFs). The hip and knee of each artificial leg can provide two electric-powered DOFs for flexion/extension. And two passive-installed DOFs of the ankle were used to achieve the motion of inversion/eversion and plantarflexion/dorsiflexion. The five-point segmented gait planning strategy is proposed to generate gait trajectories. The gait Zero Moment Point stability margin is used as a parameter to construct a stability criteria to ensure the stability of human-exoskeleton system. Based on the segmented gait trajectory planning formation strategy, the multiple-DMP sequences were proposed to model the generation trajectories. Meanwhile, in order to eliminate the effect of uncertainties in joint space, the RL was adopted to learn the trajectories. The experiment demonstrated that the proposed scheme can effectively remove interferences and uncertainties.


2000 ◽  
Vol 1719 (1) ◽  
pp. 165-174 ◽  
Author(s):  
Peter R. Stopher ◽  
David A. Hensher

Transportation planners increasingly include a stated choice (SC) experiment as part of the armory of empirical sources of information on how individuals respond to current and potential travel contexts. The accumulated experience with SC data has been heavily conditioned on analyst prejudices about the acceptable complexity of the data collection instrument, especially the number of profiles (or treatments) given to each sampled individual (and the number of attributes and alternatives to be processed). It is not uncommon for transport demand modelers to impose stringent limitations on the complexity of an SC experiment. A review of the marketing and transport literature suggests that little is known about the basis for rejecting complex designs or accepting simple designs. Although more complex designs provide the analyst with increasing degrees of freedom in the estimation of models, facilitating nonlinearity in main effects and independent two-way interactions, it is not clear what the overall behavioral gains are in increasing the number of treatments. A complex design is developed as the basis for a stated choice study, producing a fractional factorial of 32 rows. The fraction is then truncated by administering 4, 8, 16, 24, and 32 profiles to a sample of 166 individuals (producing 1, 016 treatments) in Australia and New Zealand faced with the decision to fly (or not to fly) between Australia and New Zealand by either Qantas or Ansett under alternative fare regimes. Statistical comparisons of elasticities (an appropriate behavioral basis for comparisons) suggest that the empirical gains within the context of a linear specification of the utility expression associated with each alternative in a discrete choice model may be quite marginal.


The research is actual, has a theoretical and applied nature. Theoretical developments at cost management branch of economic organization were illustrated by practical examples. In the work the follow methods are used: abstract-logical, methods of induction and deduction, system and situate approaches, methods of comparative analysis, analysis of breakeven, monographic analysis. The follow sources of information were used: literature, the results of the experimental investigations, carried out in Verkh-nevolzhsky Federal Agrarian Research Centre, observations, carried out at an agricultural organization. Theoretical structural model of cost management was created in a functional way, relationship between elements were designated in it, as well as driving forces of its realization were determined. Such terms as “cost management ” and «management accounting» were specified. The paradigm of a relation to cost process was described. Approaches to realization of cost management system of organization on the basis offlexibility principle were developed. The most important of them are: application of applied programs packages and special program means for computer, organization of feedback, account of functional relationship of cost with production results, cost accounting on elements, places of origin, carriers and centers of responsibility, as well as interactive approach. Some examples which show possibilities of using some instruments of cost management and management accounting for generation of management solutions were described. These instruments are: analysis and planning on the basis of standards of constant and variable costs and flexible cost management in interrelations to other subsystems in management system of organization (in this case it is with technology management. The work is of theoretical and practical significance.


2021 ◽  
Author(s):  
Stav Belogolovsky ◽  
Philip Korsunsky ◽  
Shie Mannor ◽  
Chen Tessler ◽  
Tom Zahavy

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.


2021 ◽  
Author(s):  
Amarildo Likmeta ◽  
Alberto Maria Metelli ◽  
Giorgia Ramponi ◽  
Andrea Tirinzoni ◽  
Matteo Giuliani ◽  
...  

AbstractIn real-world applications, inferring the intentions of expert agents (e.g., human operators) can be fundamental to understand how possibly conflicting objectives are managed, helping to interpret the demonstrated behavior. In this paper, we discuss how inverse reinforcement learning (IRL) can be employed to retrieve the reward function implicitly optimized by expert agents acting in real applications. Scaling IRL to real-world cases has proved challenging as typically only a fixed dataset of demonstrations is available and further interactions with the environment are not allowed. For this reason, we resort to a class of truly batch model-free IRL algorithms and we present three application scenarios: (1) the high-level decision-making problem in the highway driving scenario, and (2) inferring the user preferences in a social network (Twitter), and (3) the management of the water release in the Como Lake. For each of these scenarios, we provide formalization, experiments and a discussion to interpret the obtained results.


Minerals ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. 587
Author(s):  
Joao Pedro de Carvalho ◽  
Roussos Dimitrakopoulos

This paper presents a new truck dispatching policy approach that is adaptive given different mining complex configurations in order to deliver supply material extracted by the shovels to the processors. The method aims to improve adherence to the operational plan and fleet utilization in a mining complex context. Several sources of operational uncertainty arising from the loading, hauling and dumping activities can influence the dispatching strategy. Given a fixed sequence of extraction of the mining blocks provided by the short-term plan, a discrete event simulator model emulates the interaction arising from these mining operations. The continuous repetition of this simulator and a reward function, associating a score value to each dispatching decision, generate sample experiences to train a deep Q-learning reinforcement learning model. The model learns from past dispatching experience, such that when a new task is required, a well-informed decision can be quickly taken. The approach is tested at a copper–gold mining complex, characterized by uncertainties in equipment performance and geological attributes, and the results show improvements in terms of production targets, metal production, and fleet management.


Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3864
Author(s):  
Tarek Ghoul ◽  
Tarek Sayed

Speed advisories are used on highways to inform vehicles of upcoming changes in traffic conditions and apply a variable speed limit to reduce traffic conflicts and delays. This study applies a similar concept to intersections with respect to connected vehicles to provide dynamic speed advisories in real-time that guide vehicles towards an optimum speed. Real-time safety evaluation models for signalized intersections that depend on dynamic traffic parameters such as traffic volume and shock wave characteristics were used for this purpose. The proposed algorithm incorporates a rule-based approach alongside a Deep Deterministic Policy Gradient reinforcement learning technique (DDPG) to assign ideal speeds for connected vehicles at intersections and improve safety. The system was tested on two intersections using real-world data and yielded an average reduction in traffic conflicts ranging from 9% to 23%. Further analysis was performed to show that the algorithm yields tangible results even at lower market penetration rates (MPR). The algorithm was tested on the same intersection with different traffic volume conditions as well as on another intersection with different physical constraints and characteristics. The proposed algorithm provides a low-cost approach that is not computationally intensive and works towards optimizing for safety by reducing rear-end traffic conflicts.


Sign in / Sign up

Export Citation Format

Share Document