scholarly journals Choice-selective sequences dominate in cortical relative to thalamic inputs to nucleus accumbens, providing a potential substrate for credit assignment

2019 ◽  
Author(s):  
Nathan F. Parker ◽  
Avinash Baidya ◽  
Julia Cox ◽  
Laura Haetzel ◽  
Anna Zhukovskaya ◽  
...  

How are actions linked with subsequent outcomes to guide choices? The nucleus accumbens, which is implicated in this process, receives glutamatergic inputs from the prelimbic cortex and midline regions of the thalamus. However, little is known about what is represented in these input pathways. By comparing these inputs during a reinforcement learning task in mice, we discovered that prelimbic cortical inputs preferentially represent actions and choices, whereas midline thalamic inputs preferentially represent cues. Choice-selective activity in the prelimbic cortical inputs is organized in sequences that persist beyond the outcome. Through computational modeling, we demonstrate that these sequences can support the neural implementation of temporal difference learning, a powerful algorithm to connect actions and outcomes across time. Finally, we test and confirm predictions of our circuit model by direct manipulation of nucleus accumbens input neurons. Thus, we integrate experiment and modeling to suggest a neural solution for credit assignment.

2009 ◽  
Vol 21 (4) ◽  
pp. 1173-1202 ◽  
Author(s):  
Christoph Kolodziejski ◽  
Bernd Porr ◽  
Florentin Wörgötter

In this theoretical contribution, we provide mathematical proof that two of the most important classes of network learning—correlation-based differential Hebbian learning and reward-based temporal difference learning—are asymptotically equivalent when timing the learning with a modulatory signal. This opens the opportunity to consistently reformulate most of the abstract reinforcement learning framework from a correlation-based perspective more closely related to the biophysics of neurons.


eLife ◽  
2019 ◽  
Vol 8 ◽  
Author(s):  
Marco P Lehmann ◽  
He A Xu ◽  
Vasiliki Liakoni ◽  
Michael H Herzog ◽  
Wulfram Gerstner ◽  
...  

In many daily tasks, we make multiple decisions before reaching a goal. In order to learn such sequences of decisions, a mechanism to link earlier actions to later reward is necessary. Reinforcement learning (RL) theory suggests two classes of algorithms solving this credit assignment problem: In classic temporal-difference learning, earlier actions receive reward information only after multiple repetitions of the task, whereas models with eligibility traces reinforce entire sequences of actions from a single experience (one-shot). Here, we show one-shot learning of sequences. We developed a novel paradigm to directly observe which actions and states along a multi-step sequence are reinforced after a single reward. By focusing our analysis on those states for which RL with and without eligibility trace make qualitatively distinct predictions, we find direct behavioral (choice probability) and physiological (pupil dilation) signatures of reinforcement learning with eligibility trace across multiple sensory modalities.


2018 ◽  
Vol 63 ◽  
pp. 461-494
Author(s):  
Bo Liu ◽  
Ian Gemp ◽  
Mohammad Ghavamzadeh ◽  
Ji Liu ◽  
Sridhar Mahadevan ◽  
...  

In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and do not provide any finite-sample analysis. We also propose an accelerated algorithm, called GTD2-MP, that uses proximal "mirror maps" to yield an improved convergence rate. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.


2010 ◽  
Vol 22 (2) ◽  
pp. 342-376 ◽  
Author(s):  
Tetsuro Morimura ◽  
Eiji Uchibe ◽  
Junichiro Yoshimoto ◽  
Jan Peters ◽  
Kenji Doya

Most conventional policy gradient reinforcement learning (PGRL) algorithms neglect (or do not explicitly make use of) a term in the average reward gradient with respect to the policy parameter. That term involves the derivative of the stationary state distribution that corresponds to the sensitivity of its distribution to changes in the policy parameter. Although the bias introduced by this omission can be reduced by setting the forgetting rate γ for the value functions close to 1, these algorithms do not permit γ to be set exactly at γ = 1. In this article, we propose a method for estimating the log stationary state distribution derivative (LSD) as a useful form of the derivative of the stationary state distribution through backward Markov chain formulation and a temporal difference learning framework. A new policy gradient (PG) framework with an LSD is also proposed, in which the average reward gradient can be estimated by setting γ = 0, so it becomes unnecessary to learn the value functions. We also test the performance of the proposed algorithms using simple benchmark tasks and show that these can improve the performances of existing PG methods.


1999 ◽  
Vol 11 ◽  
pp. 241-276 ◽  
Author(s):  
D. E. Moriarty ◽  
A. C. Schultz ◽  
J. J. Grefenstette

There are two distinct approaches to solving reinforcement learning problems, namely, searching in value function space and searching in policy space. Temporal difference methods and evolutionary algorithms are well-known examples of these approaches. Kaelbling, Littman and Moore recently provided an informative survey of temporal difference methods. This article focuses on the application of evolutionary algorithms to the reinforcement learning problem, emphasizing alternative policy representations, credit assignment methods, and problem-specific genetic operators. Strengths and weaknesses of the evolutionary approach to reinforcement learning are presented, along with a survey of representative applications.


2012 ◽  
Vol 24 (6) ◽  
pp. 1553-1568 ◽  
Author(s):  
Samuel J. Gershman ◽  
Christopher D. Moore ◽  
Michael T. Todd ◽  
Kenneth A. Norman ◽  
Per B. Sederberg

The successor representation was introduced into reinforcement learning by Dayan ( 1993 ) as a means of facilitating generalization between states with similar successors. Although reinforcement learning in general has been used extensively as a model of psychological and neural processes, the psychological validity of the successor representation has yet to be explored. An interesting possibility is that the successor representation can be used not only for reinforcement learning but for episodic learning as well. Our main contribution is to show that a variant of the temporal context model (TCM; Howard & Kahana, 2002 ), an influential model of episodic memory, can be understood as directly estimating the successor representation using the temporal difference learning algorithm (Sutton & Barto, 1998 ). This insight leads to a generalization of TCM and new experimental predictions. In addition to casting a new normative light on TCM, this equivalence suggests a previously unexplored point of contact between different learning systems.


Author(s):  
Marco Boaretto ◽  
Gabriel Chaves Becchi ◽  
Luiza Scapinello Aquino ◽  
Aderson Cleber Pifer ◽  
Helon Vicente Hultmann Ayala ◽  
...  

2019 ◽  
Author(s):  
Leor M Hackel ◽  
Jeffrey Jordan Berg ◽  
Björn Lindström ◽  
David Amodio

Do habits play a role in our social impressions? To investigate the contribution of habits to the formation of social attitudes, we examined the roles of model-free and model-based reinforcement learning in social interactions—computations linked in past work to habit and planning, respectively. Participants in this study learned about novel individuals in a sequential reinforcement learning paradigm, choosing financial advisors who led them to high- or low-paying stocks. Results indicated that participants relied on both model-based and model-free learning, such that each independently predicted choice during the learning task and self-reported liking in a post-task assessment. Specifically, participants liked advisors who could provide large future rewards as well as advisors who had provided them with large rewards in the past. Moreover, participants varied in their use of model-based and model-free learning strategies, and this individual difference influenced the way in which learning related to self-reported attitudes: among participants who relied more on model-free learning, model-free social learning related more to post-task attitudes. We discuss implications for attitudes, trait impressions, and social behavior, as well as the role of habits in a memory systems model of social cognition.


Sign in / Sign up

Export Citation Format

Share Document