REINFORCEMENT LEARNING WITH GOAL-DIRECTED ELIGIBILITY TRACES

The eligibility trace is the most important mechanism used so far in reinforcement learning to handle delayed reward. Here, we introduce a new kind of eligibility trace, the goal-directed trace, and show that it results in more reliable learning than the conventional trace. In addition, we also propose a new efficient algorithm for solving the goal-directed reinforcement learning problem.

Download Full-text

Temporal Uncertainty During Overshadowing

Computational Neuroscience for Advancing Artificial Intelligence ◽

10.4018/978-1-60960-021-1.ch003 ◽

2011 ◽

pp. 46-55

Author(s):

Dómhnall J. Jennings ◽

Eduardo Alonso ◽

Esther Mondragón ◽

Charlotte Bonardi

Keyword(s):

Machine Learning ◽

Reinforcement Learning ◽

Associative Learning ◽

Learning Theories ◽

Temporal Difference ◽

Temporal Uncertainty ◽

Learning Problem ◽

Difference Model ◽

Distribution Form ◽

Temporal Properties

Standard associative learning theories typically fail to conceptualise the temporal properties of a stimulus, and hence cannot easily make predictions about the effects such properties might have on the magnitude of conditioning phenomena. Despite this, in intuitive terms we might expect that the temporal properties of a stimulus that is paired with some outcome to be important. In particular, there is no previous research addressing the way that fixed or variable duration stimuli can affect overshadowing. In this chapter we report results which show that the degree of overshadowing depends on the distribution form - fixed or variable - of the overshadowing stimulus, and argue that conditioning is weaker under conditions of temporal uncertainty. These results are discussed in terms of models of conditioning and timing. We conclude that the temporal difference model, which has been extensively applied to the reinforcement learning problem in machine learning, accounts for the key findings of our study.

Download Full-text

Evolving Equilibrium Policies for a Multiagent Reinforcement Learning Problem with State Attractors

Computational Collective Intelligence. Technologies and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-642-23938-0_21 ◽

2011 ◽

pp. 201-210 ◽

Cited By ~ 1

Author(s):

Florin Leon

Keyword(s):

Reinforcement Learning ◽

Learning Problem ◽

Multiagent Reinforcement Learning

Download Full-text

Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity

Neural Computation ◽

10.1162/neco.2007.19.6.1468 ◽

2007 ◽

Vol 19 (6) ◽

pp. 1468-1502 ◽

Cited By ~ 159

Author(s):

Răzvan V. Florian

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Neural Model ◽

Spike Timing ◽

Spike Response ◽

Learning Rules ◽

Xor Problem ◽

Eligibility Trace ◽

Spike Response Model ◽

Intrinsic Plasticity

The persistent modification of synaptic efficacy as a function of the relative timing of pre- and postsynaptic spikes is a phenomenon known as spike-timing-dependent plasticity (STDP). Here we show that the modulation of STDP by a global reward signal leads to reinforcement learning. We first derive analytically learning rules involving reward-modulated spike-timing-dependent synaptic and intrinsic plasticity, by applying a reinforcement learning algorithm to the stochastic spike response model of spiking neurons. These rules have several features common to plasticity mechanisms experimentally found in the brain. We then demonstrate in simulations of networks of integrate-and-fire neurons the efficacy of two simple learning rules involving modulated STDP. One rule is a direct extension of the standard STDP model (modulated STDP), and the other one involves an eligibility trace stored at each synapse that keeps a decaying memory of the relationships between the recent pairs of pre- and postsynaptic spike pairs (modulated STDP with eligibility trace). This latter rule permits learning even if the reward signal is delayed. The proposed rules are able to solve the XOR problem with both rate coded and temporally coded input and to learn a target output firing-rate pattern. These learning rules are biologically plausible, may be used for training generic artificial spiking neural networks, regardless of the neural model used, and suggest the experimental investigation in animals of the existence of reward-modulated STDP.

Download Full-text

An Introduction to Intertask Transfer for Reinforcement Learning

AI Magazine ◽

10.1609/aimag.v32i1.2329 ◽

2011 ◽

Vol 32 (1) ◽

pp. 15 ◽

Cited By ~ 18

Author(s):

Matthew E. Taylor ◽

Peter Stone

Keyword(s):

Reinforcement Learning ◽

Transfer Learning ◽

Learning Problem ◽

Open Problems ◽

Learning Framework ◽

Learning Domains ◽

Multiple Tasks ◽

Exciting Area ◽

Generalize Information ◽

Selection Of

Transfer learning has recently gained popularity due to the development of algorithms that can successfully generalize information across multiple tasks. This article focuses on transfer in the context of reinforcement learning domains, a general learning framework where an agent acts in an environment to maximize a reward signal. The goals of this article are to (1) familiarize readers with the transfer learning problem in reinforcement learning domains, (2) explain why the problem is both interesting and difﬁcult, (3) present a selection of existing techniques that demonstrate different solutions, and (4) provide representative open problems in the hope of encouraging additional research in this exciting area.

Download Full-text

Reinforcement Learning for Mixing Loop Control with Flow Variable Eligibility Trace

2019 IEEE Conference on Control Technology and Applications (CCTA) ◽

10.1109/ccta.2019.8920398 ◽

2019 ◽

Author(s):

Anders Overgaard ◽

Brian Kongsgaard Nielsen ◽

Carsten Skovmose Kallesoe ◽

Jan Dimon Bendtsen

Keyword(s):

Reinforcement Learning ◽

Flow Variable ◽

Loop Control ◽

Eligibility Trace

Download Full-text

One-shot learning and behavioral eligibility traces in sequential decision making

eLife ◽

10.7554/elife.47463 ◽

2019 ◽

Vol 8 ◽

Author(s):

Marco P Lehmann ◽

He A Xu ◽

Vasiliki Liakoni ◽

Michael H Herzog ◽

Wulfram Gerstner ◽

...

Keyword(s):

Reinforcement Learning ◽

Sequential Decision Making ◽

Choice Probability ◽

Temporal Difference Learning ◽

Sequential Decision ◽

Credit Assignment ◽

Behavioral Choice ◽

Sensory Modalities ◽

Eligibility Trace ◽

Reward Information

In many daily tasks, we make multiple decisions before reaching a goal. In order to learn such sequences of decisions, a mechanism to link earlier actions to later reward is necessary. Reinforcement learning (RL) theory suggests two classes of algorithms solving this credit assignment problem: In classic temporal-difference learning, earlier actions receive reward information only after multiple repetitions of the task, whereas models with eligibility traces reinforce entire sequences of actions from a single experience (one-shot). Here, we show one-shot learning of sequences. We developed a novel paradigm to directly observe which actions and states along a multi-step sequence are reinforced after a single reward. By focusing our analysis on those states for which RL with and without eligibility trace make qualitatively distinct predictions, we find direct behavioral (choice probability) and physiological (pupil dilation) signatures of reinforcement learning with eligibility trace across multiple sensory modalities.

Download Full-text

End-to-End Autonomous Exploration with Deep Reinforcement Learning and Intrinsic Motivation

Computational Intelligence and Neuroscience ◽

10.1155/2021/9945044 ◽

2021 ◽

Vol 2021 ◽

pp. 1-15

Author(s):

Xiaogang Ruan ◽

Peng Li ◽

Xiaoqing Zhu ◽

Hejie Yu ◽

Naigong Yu

Keyword(s):

Reinforcement Learning ◽

Intrinsic Motivation ◽

Driving Forces ◽

Temporal Distance ◽

Training Methods ◽

Complex Environments ◽

Learning Problem ◽

Autonomous Exploration ◽

Exploration Behavior ◽

Efficient Exploration

Developing artificial intelligence (AI) agents is challenging for efficient exploration in visually rich and complex environments. In this study, we formulate the exploration question as a reinforcement learning problem and rely on intrinsic motivation to guide exploration behavior. Such intrinsic motivation is driven by curiosity and is calculated based on episode memory. To distribute the intrinsic motivation, we use a count-based method and temporal distance to generate it synchronously. We tested our approach in 3D maze-like environments and validated its performance in exploration tasks through extensive experiments. The experimental results show that our agent can learn exploration ability from raw sensory input and accomplish autonomous exploration across different mazes. In addition, the learned policy is not biased by stochastic objects. We also analyze the effects of different training methods and driving forces on exploration policy.

Download Full-text

Configurable Environments in Reinforcement Learning: An Overview

Special Topics in Information Technology - SpringerBriefs in Applied Sciences and Technology ◽

10.1007/978-3-030-85918-3_9 ◽

2022 ◽

pp. 101-113

Author(s):

Alberto Maria Metelli

Keyword(s):

Reinforcement Learning ◽

Markov Decision Processes ◽

Learning Process ◽

Real World ◽

Decision Processes ◽

Learning Problem ◽

Complex Control ◽

Control Frequency ◽

Markov Decision ◽

And Control

AbstractReinforcement Learning (RL) has emerged as an effective approach to address a variety of complex control tasks. In a typical RL problem, an agent interacts with the environment by perceiving observations and performing actions, with the ultimate goal of maximizing the cumulative reward. In the traditional formulation, the environment is assumed to be a fixed entity that cannot be externally controlled. However, there exist several real-world scenarios in which the environment offers the opportunity to configure some of its parameters, with diverse effects on the agent’s learning process. In this contribution, we provide an overview of the main aspects of environment configurability. We start by introducing the formalism of the Configurable Markov Decision Processes (Conf-MDPs) and we illustrate the solutions concepts. Then, we revise the algorithms for solving the learning problem in Conf-MDPs. Finally, we present two applications of Conf-MDPs: policy space identification and control frequency adaptation.

Download Full-text

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6417 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8878-8885

Author(s):

Haoyu Song ◽

Wei-Nan Zhang ◽

Jingwen Hu ◽

Ting Liu

Keyword(s):

Reinforcement Learning ◽

Natural Language ◽

Experimental Results ◽

Learning Problem ◽

Model Based ◽

Consistency Evaluation

Consistency is one of the major challenges faced by dialogue agents. A human-like dialogue agent should not only respond naturally, but also maintain a consistent persona. In this paper, we exploit the advantages of natural language inference (NLI) technique to address the issue of generating persona consistent dialogues. Different from existing work that re-ranks the retrieved responses through an NLI model, we cast the task as a reinforcement learning problem and propose to exploit the NLI signals from response-persona pairs as rewards for the process of dialogue generation. Specifically, our generator employs an attention-based encoder-decoder to generate persona-based responses. Our evaluator consists of two components: an adversarially trained naturalness module and an NLI based consistency module. Moreover, we use another well-performed NLI model in the evaluation of persona-consistency. Experimental results on both human and automatic metrics, including the model-based consistency evaluation, demonstrate that the proposed approach outperforms strong generative baselines, especially in the persona-consistency of generated responses.

Download Full-text

Crossbar Adaptive Array: The first connectionist network that solved the delayed reinforcement learning problem

Artificial Neural Nets and Genetic Algorithms ◽

10.1007/978-3-7091-6384-9_54 ◽

1999 ◽

pp. 320-325 ◽

Cited By ~ 8

Author(s):

S. Bozinovski

Keyword(s):

Reinforcement Learning ◽

Adaptive Array ◽

Delayed Reinforcement ◽

Learning Problem ◽

Connectionist Network

Download Full-text