A Survey of Multi-Task Deep Reinforcement Learning

Nelson Vithayathil Varghese; Qusay H. Mahmoud

doi:10.3390/electronics9091363

A Survey of Multi-Task Deep Reinforcement Learning

Electronics ◽

10.3390/electronics9091363 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1363

Author(s):

Nelson Vithayathil Varghese ◽

Qusay H. Mahmoud

Keyword(s):

Deep Learning ◽

Reinforcement Learning ◽

Intelligent Agents ◽

Learning Algorithms ◽

Representation Learning ◽

Vital Role ◽

Learning Agents ◽

Model Free ◽

Limited Applicability ◽

Rich Data

Driven by the recent technological advancements within the field of artificial intelligence research, deep learning has emerged as a promising representation learning technique across all of the machine learning classes, especially within the reinforcement learning arena. This new direction has given rise to the evolution of a new technological domain named deep reinforcement learning, which combines the representational learning power of deep learning with existing reinforcement learning methods. Undoubtedly, the inception of deep reinforcement learning has played a vital role in optimizing the performance of reinforcement learning-based intelligent agents with model-free based approaches. Although these methods could improve the performance of agents to a greater extent, they were mainly limited to systems that adopted reinforcement learning algorithms focused on learning a single task. At the same moment, the aforementioned approach was found to be relatively data-inefficient, particularly when reinforcement learning agents needed to interact with more complex and rich data environments. This is primarily due to the limited applicability of deep reinforcement learning algorithms to many scenarios across related tasks from the same environment. The objective of this paper is to survey the research challenges associated with multi-tasking within the deep reinforcement arena and present the state-of-the-art approaches by comparing and contrasting recent solutions, namely DISTRAL (DIStill & TRAnsfer Learning), IMPALA(Importance Weighted Actor-Learner Architecture) and PopArt that aim to address core challenges such as scalability, distraction dilemma, partial observability, catastrophic forgetting and negative knowledge transfer.

Download Full-text

Experimental evaluation of model-free reinforcement learning algorithms for continuous HVAC control

Applied Energy ◽

10.1016/j.apenergy.2021.117164 ◽

2021 ◽

Vol 298 ◽

pp. 117164

Author(s):

Marco Biemann ◽

Fabian Scheller ◽

Xiufeng Liu ◽

Lizhen Huang

Keyword(s):

Reinforcement Learning ◽

Experimental Evaluation ◽

Learning Algorithms ◽

Model Free ◽

Hvac Control

Download Full-text

Deterministic Value-Policy Gradients

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5732 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3316-3323

Author(s):

Qingpeng Cai ◽

Ling Pan ◽

Pingzhong Tang

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Infinite Horizon ◽

Gradient Algorithm ◽

Continuous Control ◽

Model Bias ◽

Model Free ◽

Policy Gradient ◽

Analytical Gradients

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Download Full-text

Model-Free Deep Reinforcement Learning—Algorithms and Applications

Reinforcement Learning Algorithms: Analysis and Applications - Studies in Computational Intelligence ◽

10.1007/978-3-030-41188-6_10 ◽

2021 ◽

pp. 109-121

Author(s):

Fabian Otto

Keyword(s):

Reinforcement Learning ◽

Learning Algorithms ◽

Model Free

Download Full-text

Episodic Control as Meta-Reinforcement Learning

10.1101/360537 ◽

2018 ◽

Cited By ~ 3

Author(s):

S Ritter ◽

JX Wang ◽

Z Kurth-Nelson ◽

M Botvinick

Keyword(s):

Reinforcement Learning ◽

Episodic Memory ◽

Learning Strategies ◽

Learning Algorithms ◽

Memory System ◽

Generic Model ◽

Model Based ◽

Model Free

AbstractRecent research has placed episodic reinforcement learning (RL) alongside model-free and model-based RL on the list of processes centrally involved in human reward-based learning. In the present work, we extend the unified account of model-free and model-based RL developed by Wang et al. (2018) to further integrate episodic learning. In this account, a generic model-free “meta-learner” learns to deploy and coordinate among all of these learning algorithms. The meta-learner learns through brief encounters with many novel tasks, so that it learns to learn about new tasks. We show that when equipped with an episodic memory system inspired by theories of reinstatement and gating, the meta-learner learns to use the episodic and model-based learning algorithms observed in humans in a task designed to dissociate among the influences of various learning strategies. We discuss implications and predictions of the model.

Download Full-text

The Machine-Learning Approach of Reinforcement Learning

How the Brain Makes Decisions ◽

10.1093/oso/9780198824367.003.0016 ◽

2020 ◽

pp. 105-109

Author(s):

Thomas Boraud

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Decision Making ◽

Reinforcement Learning ◽

Learning Algorithms ◽

Model Free ◽

Machine Learning Approach ◽

Markov Decision ◽

Decision Making Processes ◽

Alternative Approaches

This chapter assesses alternative approaches of reinforcement learning that are developed by machine learning. The initial goal of this branch of artificial intelligence, which appeared in the middle of the twentieth century, was to develop and implement algorithms that allow a machine to learn. Originally, they were computers or more or less autonomous robotic automata. As artificial intelligence has developed and cross-fertilized with neuroscience, it has begun to be used to model the learning and decision-making processes for biological agents, broadening the meaning of the word ‘machine’. Theoreticians of this discipline define several categories of learning, but this chapter only deals with those which are related to reinforcement learning. To understand how these algorithms work, it is necessary first of all to explain the Markov chain and the Markov decision-making process. The chapter then goes on to examine model-free reinforcement learning algorithms, the actor-critic model, and finally model-based reinforcement learning algorithms.

Download Full-text

Estimating Scale-Invariant Future in Continuous Time

Neural Computation ◽

10.1162/neco_a_01171 ◽

2019 ◽

Vol 31 (4) ◽

pp. 681-709 ◽

Cited By ~ 6

Author(s):

Zoran Tiganj ◽

Samuel J. Gershman ◽

Per B. Sederberg ◽

Marc W. Howard

Keyword(s):

Reinforcement Learning ◽

Continuous Time ◽

Learning Algorithms ◽

Future Time ◽

Scale Invariant ◽

Model Based ◽

Model Free ◽

Transition Functions ◽

Future Reward ◽

Future Outcomes

Natural learners must compute an estimate of future outcomes that follow from a stimulus in continuous time. Widely used reinforcement learning algorithms discretize continuous time and estimate either transition functions from one step to the next (model-based algorithms) or a scalar value of exponentially discounted future reward using the Bellman equation (model-free algorithms). An important drawback of model-based algorithms is that computational cost grows linearly with the amount of time to be simulated. An important drawback of model-free algorithms is the need to select a timescale required for exponential discounting. We present a computational mechanism, developed based on work in psychology and neuroscience, for computing a scale-invariant timeline of future outcomes. This mechanism efficiently computes an estimate of inputs as a function of future time on a logarithmically compressed scale and can be used to generate a scale-invariant power-law-discounted estimate of expected future reward. The representation of future time retains information about what will happen when. The entire timeline can be constructed in a single parallel operation that generates concrete behavioral and neural predictions. This computational mechanism could be incorporated into future reinforcement learning algorithms.

Download Full-text

Multi-View Deep Attention Network for Reinforcement Learning (Student Abstract)

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i10.7177 ◽

2020 ◽

Vol 34 (10) ◽

pp. 13811-13812

Author(s):

Yueyue Hu ◽

Shiliang Sun ◽

Xin Xu ◽

Jing Zhao

Keyword(s):

Reinforcement Learning ◽

Single Agent ◽

Representation Learning ◽

Learning Task ◽

Comprehensive Strategy ◽

Attention Network ◽

Single View ◽

Learning Agents ◽

Proposed Model ◽

First Time

The representation approximated by a single deep network is usually limited for reinforcement learning agents. We propose a novel multi-view deep attention network (MvDAN), which introduces multi-view representation learning into the reinforcement learning task for the first time. The proposed model approximates a set of strategies from multiple representations and combines these strategies based on attention mechanisms to provide a comprehensive strategy for a single-agent. Experimental results on eight Atari video games show that the MvDAN has effective competitive performance than single-view reinforcement learning methods.

Download Full-text

Deep reinforcement learning methods for structure-guided processing path optimization

Journal of Intelligent Manufacturing ◽

10.1007/s10845-021-01805-z ◽

2021 ◽

Author(s):

Johannes Dornheim ◽

Lukas Morand ◽

Samuel Zeitvogel ◽

Tarek Iraki ◽

Norbert Link ◽

...

Keyword(s):

Reinforcement Learning ◽

A Priori ◽

Learning Algorithms ◽

Optimal Path ◽

Path Optimization ◽

Material Structure ◽

Target Structure ◽

Forming Process ◽

Structure Space ◽

Model Free

AbstractA major goal of materials design is to find material structures with desired properties and in a second step to find a processing path to reach one of these structures. In this paper, we propose and investigate a deep reinforcement learning approach for the optimization of processing paths. The goal is to find optimal processing paths in the material structure space that lead to target-structures, which have been identified beforehand to result in desired material properties. There exists a target set containing one or multiple different structures, bearing the desired properties. Our proposed methods can find an optimal path from a start structure to a single target structure, or optimize the processing paths to one of the equivalent target-structures in the set. In the latter case, the algorithm learns during processing to simultaneously identify the best reachable target structure and the optimal path to it. The proposed methods belong to the family of model-free deep reinforcement learning algorithms. They are guided by structure representations as features of the process state and by a reward signal, which is formulated based on a distance function in the structure space. Model-free reinforcement learning algorithms learn through trial and error while interacting with the process. Thereby, they are not restricted to information from a priori sampled processing data and are able to adapt to the specific process. The optimization itself is model-free and does not require any prior knowledge about the process itself. We instantiate and evaluate the proposed methods by optimizing paths of a generic metal forming process. We show the ability of both methods to find processing paths leading close to target structures and the ability of the extended method to identify target-structures that can be reached effectively and efficiently and to focus on these targets for sample efficient processing path optimization.

Download Full-text

Dopamine transients do not act as model-free prediction errors during associative learning

Nature Communications ◽

10.1038/s41467-019-13953-1 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 6

Author(s):

Melissa J. Sharpe ◽

Hannah M. Batchelor ◽

Lauren E. Mueller ◽

Chun Yun Chang ◽

Etienne J. P. Maes ◽

...

Keyword(s):

Reinforcement Learning ◽

Associative Learning ◽

Prediction Error ◽

Intrinsic Value ◽

Learning Algorithms ◽

Dopamine Neurons ◽

Prediction Errors ◽

Model Free ◽

Reward Prediction ◽

Excess Value

AbstractDopamine neurons are proposed to signal the reward prediction error in model-free reinforcement learning algorithms. This term represents the unpredicted or ‘excess’ value of the rewarding event, value that is then added to the intrinsic value of any antecedent cues, contexts or events. To support this proposal, proponents cite evidence that artificially-induced dopamine transients cause lasting changes in behavior. Yet these studies do not generally assess learning under conditions where an endogenous prediction error would occur. Here, to address this, we conducted three experiments where we optogenetically activated dopamine neurons while rats were learning associative relationships, both with and without reward. In each experiment, the antecedent cues failed to acquire value and instead entered into associations with the later events, whether valueless cues or valued rewards. These results show that in learning situations appropriate for the appearance of a prediction error, dopamine transients support associative, rather than model-free, learning.

Download Full-text

Learning and generalising object extraction skill for contact-rich disassembly tasks: an introductory study

The International Journal of Advanced Manufacturing Technology ◽

10.1007/s00170-021-08086-z ◽

2021 ◽

Author(s):

Antonio Serrano-Muñoz ◽

Nestor Arana-Arexolaleiba ◽

Dimitrios Chrysostomou ◽

Simon Bøgh

Keyword(s):

Machine Learning ◽

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Robotic Manipulation ◽

Object Extraction ◽

Learning Agents ◽

Machine Learning Methods ◽

Key Concepts ◽

Planning And Operation

AbstractRemanufacturing automation must be designed to be flexible and robust enough to overcome the uncertainties, conditions of the products, and complexities in the planning and operation of the processes. Machine learning methods, in particular reinforcement learning, are presented as techniques to learn, improve, and generalise the automation of many robotic manipulation tasks (most of them related to grasping, picking, or assembly). However, not much has been exploited in remanufacturing, in particular in disassembly tasks. This work presents the state of the art of contact-rich disassembly using reinforcement learning algorithms and a study about the generalisation of object extraction skills when applied to contact-rich disassembly tasks. The generalisation capabilities of two state-of-the-art reinforcement learning agents (trained in simulation) are tested and evaluated in simulation, and real world while perform a disassembly task. Results show that at least one of the agents can generalise the contact-rich extraction skill. Besides, this work identifies key concepts and gaps for the reinforcement learning algorithms’ research and application on disassembly tasks.

Download Full-text