scholarly journals Task Transfer by Preference-Based Cost Learning

Author(s):  
Mingxuan Jing ◽  
Xiaojian Ma ◽  
Wenbing Huang ◽  
Fuchun Sun ◽  
Huaping Liu

The goal of task transfer in reinforcement learning is migrating the action policy of an agent to the target task from the source task. Given their successes on robotic action planning, current methods mostly rely on two requirements: exactlyrelevant expert demonstrations or the explicitly-coded cost function on target task, both of which, however, are inconvenient to obtain in practice. In this paper, we relax these two strong conditions by developing a novel task transfer framework where the expert preference is applied as a guidance. In particular, we alternate the following two steps: Firstly, letting experts apply pre-defined preference rules to select related expert demonstrates for the target task. Secondly, based on the selection result, we learn the target cost function and trajectory distribution simultaneously via enhanced Adversarial MaxEnt IRL and generate more trajectories by the learned target distribution for the next preference selection. The theoretical analysis on the distribution learning and convergence of the proposed algorithm are provided. Extensive simulations on several benchmarks have been conducted for further verifying the effectiveness of the proposed method.

Author(s):  
Marko Švaco ◽  
Bojan Jerbić ◽  
Mateo Polančec ◽  
Filip Šuligoj

Author(s):  
Mohammadamin Barekatain ◽  
Ryo Yonetani ◽  
Masashi Hamaya

Transfer reinforcement learning (RL) aims at improving the learning efficiency of an agent by exploiting knowledge from other source agents trained on relevant tasks. However, it remains challenging to transfer knowledge between different environmental dynamics without having access to the source environments. In this work, we explore a new challenge in transfer RL, where only a set of source policies collected under diverse unknown dynamics is available for learning a target task efficiently. To address this problem, the proposed approach, MULTI-source POLicy AggRegation (MULTIPOLAR), comprises two key techniques. We learn to aggregate the actions provided by the source policies adaptively to maximize the target task performance. Meanwhile, we learn an auxiliary network that predicts residuals around the aggregated actions, which ensures the target policy's expressiveness even when some of the source policies perform poorly. We demonstrated the effectiveness of MULTIPOLAR through an extensive experimental evaluation across six simulated environments ranging from classic control problems to challenging robotics simulations, under both continuous and discrete action spaces. The demo videos and code are available on the project webpage: https://omron-sinicx.github.io/multipolar/.


2018 ◽  
Vol 30 (7) ◽  
pp. 1983-2004 ◽  
Author(s):  
Yazhou Hu ◽  
Bailu Si

We propose a neural network model for reinforcement learning to control a robotic manipulator with unknown parameters and dead zones. The model is composed of three networks. The state of the robotic manipulator is predicted by the state network of the model, the action policy is learned by the action network, and the performance index of the action policy is estimated by a critic network. The three networks work together to optimize the performance index based on the reinforcement learning control scheme. The convergence of the learning methods is analyzed. Application of the proposed model on a simulated two-link robotic manipulator demonstrates the effectiveness and the stability of the model.


Author(s):  
Zhe Xu ◽  
Ufuk Topcu

Transferring high-level knowledge from a source task to a target task is an effective way to expedite reinforcement learning (RL). For example, propositional logic and first-order logic have been used as representations of such knowledge. We study the transfer of knowledge between tasks in which the timing of the events matters. We call such tasks temporal tasks. We concretize similarity between temporal tasks through a notion of logical transferability, and develop a transfer learning approach between different yet similar temporal tasks. We first propose an inference technique to extract metric interval temporal logic (MITL) formulas in sequential disjunctive normal form from labeled trajectories collected in RL of the two tasks. If logical transferability is identified through this inference, we construct a timed automaton for each sequential conjunctive subformula of the inferred MITL formulas from both tasks. We perform RL on the extended state which includes the locations and clock valuations of the timed automata for the source task. We then establish mappings between the corresponding components (clocks, locations, etc.) of the timed automata from the two tasks, and transfer the extended Q-functions based on the established mappings. Finally, we perform RL on the extended state for the target task, starting with the transferred extended Q-functions. Our implementation results show, depending on how similar the source task and the target task are, that the sampling efficiency for the target task can be improved by up to one order of magnitude by performing RL in the extended state space, and further improved by up to another order of magnitude using the transferred extended Q-functions.


1997 ◽  
Vol 272 (6) ◽  
pp. C2037-C2048 ◽  
Author(s):  
X. Yu ◽  
N. M. Alpert ◽  
E. D. Lewandowski

Measurements of oxidative metabolism in the heart from dynamic 13C nuclear magnetic resonance (NMR) spectroscopy rely on 13C turnover in the NMR-detectable glutamate pool. A kinetic model was developed for the analysis of isotope turnover to determine tricarboxylic acid cycle flux (VTCA) and the interconversion rate between alpha-ketoglutarate and glutamate (F1) by fitting the model to NMR data of glutamate enrichment. The results of data fitting are highly reproducible when the noise level is within 10%, making this model applicable to single or grouped experiments. The values for VTCA and F1 were unchanged whether obtained from least-squares fitting of the model to mean experimental enrichment data with standard deviations in the cost function (VTCA = 10.52 mumol.min-1.g dry wt-1, F1 = 10.67 mumol.min-1.g dry wt-1) or to the individual enrichment values for each heart with the NMR noise level in the cost function (VTCA = 10.67 mumol.min-1.g dry wt-1, F1 = 10.18 mumol.min-1.g dry wt-1). Computer simulation and theoretical analysis indicate that glutamate enrichment kinetics are insensitive to the fractional enrichment of acetyl-CoA and changes in small intermediate pools (< 1 mumol/g dry wt). Therefore, high-resolution NMR analysis of tissue extracts and biochemical assays for intermediates at low concentrations are unnecessary. However, a high correlation between VTCA and F1 exists, as anticipated from competition for alpha-ketoglutarate, which indicates the utility of introducing independent experimental constraints into the data fitting for accurate quantification.


Sign in / Sign up

Export Citation Format

Share Document