scholarly journals Differential reinforcement encoding along the hippocampal long axis helps resolve the explore/exploit dilemma

2020 ◽  
Author(s):  
Alexandre Y. Dombrovski ◽  
Beatriz Luna ◽  
Michael N. Hallquist

ABSTRACTWhen making decisions, should one exploit known good options or explore potentially better alternatives? Exploration of spatially unstructured options depends on the neocortex, striatum, and amygdala. In natural environments, however, better options often cluster together, forming structured value distributions. The hippocampus binds reward information into allocentric cognitive maps to support navigation and foraging in such spaces. Using a reinforcement learning task with a spatially structured reward function, we show that human posterior hippocampus (PH) invigorates exploration while anterior hippocampus (AH) supports the transition to exploitation. These dynamics depend on differential reinforcement representations in the PH and AH. Whereas local reward prediction error signals are early and phasic in the PH tail, global value maximum signals are delayed and sustained in the AH body. AH compresses reinforcement information across episodes, updating the location and prominence of the value maximum and displaying goal cell-like ramping activity when navigating toward it.

2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Alexandre Y. Dombrovski ◽  
Beatriz Luna ◽  
Michael N. Hallquist

Abstract When making decisions, should one exploit known good options or explore potentially better alternatives? Exploration of spatially unstructured options depends on the neocortex, striatum, and amygdala. In natural environments, however, better options often cluster together, forming structured value distributions. The hippocampus binds reward information into allocentric cognitive maps to support navigation and foraging in such spaces. Here we report that human posterior hippocampus (PH) invigorates exploration while anterior hippocampus (AH) supports the transition to exploitation on a reinforcement learning task with a spatially structured reward function. These dynamics depend on differential reinforcement representations in the PH and AH. Whereas local reward prediction error signals are early and phasic in the PH tail, global value maximum signals are delayed and sustained in the AH body. AH compresses reinforcement information across episodes, updating the location and prominence of the value maximum and displaying goal cell-like ramping activity when navigating toward it.


2019 ◽  
Author(s):  
Emma L. Roscow ◽  
Matthew W. Jones ◽  
Nathan F. Lepora

AbstractNeural activity encoding recent experiences is replayed during sleep and rest to promote consolidation of the corresponding memories. However, precisely which features of experience influence replay prioritisation to optimise adaptive behaviour remains unclear. Here, we trained adult male rats on a novel maze-based rein-forcement learning task designed to dissociate reward outcomes from reward-prediction errors. Four variations of a reinforcement learning model were fitted to the rats’ behaviour over multiple days. Behaviour was best predicted by a model incorporating replay biased by reward-prediction error, compared to the same model with no replay; random replay or reward-biased replay produced poorer predictions of behaviour. This insight disentangles the influences of salience on replay, suggesting that reinforcement learning is tuned by post-learning replay biased by reward-prediction error, not by reward per se. This work therefore provides a behavioural and theoretical toolkit with which to measure and interpret replay in striatal, hippocampal and neocortical circuits.


2014 ◽  
Vol 26 (3) ◽  
pp. 447-458 ◽  
Author(s):  
Ernest Mas-Herrero ◽  
Josep Marco-Pallarés

In decision-making processes, the relevance of the information yielded by outcomes varies across time and situations. It increases when previous predictions are not accurate and in contexts with high environmental uncertainty. Previous fMRI studies have shown an important role of medial pFC in coding both reward prediction errors and the impact of this information to guide future decisions. However, it is unclear whether these two processes are dissociated in time or occur simultaneously, suggesting that a common mechanism is engaged. In the present work, we studied the modulation of two electrophysiological responses associated to outcome processing—the feedback-related negativity ERP and frontocentral theta oscillatory activity—with the reward prediction error and the learning rate. Twenty-six participants performed two learning tasks differing in the degree of predictability of the outcomes: a reversal learning task and a probabilistic learning task with multiple blocks of novel cue–outcome associations. We implemented a reinforcement learning model to obtain the single-trial reward prediction error and the learning rate for each participant and task. Our results indicated that midfrontal theta activity and feedback-related negativity increased linearly with the unsigned prediction error. In addition, variations of frontal theta oscillatory activity predicted the learning rate across tasks and participants. These results support the existence of a common brain mechanism for the computation of unsigned prediction error and learning rate.


2015 ◽  
Vol 114 (5) ◽  
pp. 2600-2615 ◽  
Author(s):  
Kei Oyama ◽  
Yukina Tateyama ◽  
István Hernádi ◽  
Philippe N. Tobler ◽  
Toshio Iijima ◽  
...  

To investigate how the striatum integrates sensory information with reward information for behavioral guidance, we recorded single-unit activity in the dorsal striatum of head-fixed rats participating in a probabilistic Pavlovian conditioning task with auditory conditioned stimuli (CSs) in which reward probability was fixed for each CS but parametrically varied across CSs. We found that the activity of many neurons was linearly correlated with the reward probability indicated by the CSs. The recorded neurons could be classified according to their firing patterns into functional subtypes coding reward probability in different forms such as stimulus value, reward expectation, and reward prediction error. These results suggest that several functional subgroups of dorsal striatal neurons represent different kinds of information formed through extensive prior exposure to CS-reward contingencies.


eLife ◽  
2020 ◽  
Vol 9 ◽  
Author(s):  
Bastien Blain ◽  
Robb B Rutledge

Subjective well-being or happiness is often associated with wealth. Recent studies suggest that momentary happiness is associated with reward prediction error, the difference between experienced and predicted reward, a key component of adaptive behaviour. We tested subjects in a reinforcement learning task in which reward size and probability were uncorrelated, allowing us to dissociate between the contributions of reward and learning to happiness. Using computational modelling, we found convergent evidence across stable and volatile learning tasks that happiness, like behaviour, is sensitive to learning-relevant variables (i.e. probability prediction error). Unlike behaviour, happiness is not sensitive to learning-irrelevant variables (i.e. reward prediction error). Increasing volatility reduces how many past trials influence behaviour but not happiness. Finally, depressive symptoms reduce happiness more in volatile than stable environments. Our results suggest that how we learn about our world may be more important for how we feel than the rewards we actually receive.


2021 ◽  
Author(s):  
Anthony M.V. Jakob ◽  
John G Mikhael ◽  
Allison E Hamilos ◽  
John A Assad ◽  
Samuel J Gershman

The role of dopamine as a reward prediction error signal in reinforcement learning tasks has been well-established over the past decades. Recent work has shown that the reward prediction error interpretation can also account for the effects of dopamine on interval timing by controlling the speed of subjective time. According to this theory, the timing of the dopamine signal relative to reward delivery dictates whether subjective time speeds up or slows down: Early DA signals speed up subjective time and late signals slow it down. To test this bidirectional prediction, we reanalyzed measurements of dopaminergic neurons in the substantia nigra pars compacta of mice performing a self-timed movement task. Using the slope of ramping dopamine activity as a read-out of subjective time speed, we found that trial-by-trial changes in the slope could be predicted from the timing of dopamine activity on the previous trial. This result provides a key piece of evidence supporting a unified computational theory of reinforcement learning and interval timing.


Author(s):  
Riley Simmons-Edler ◽  
Ben Eisner ◽  
Daniel Yang ◽  
Anthony Bisulco ◽  
Eric Mitchell ◽  
...  

A major challenge in reinforcement learning is exploration, when local dithering methods such as epsilon-greedy sampling are insufficient to solve a given task. Many recent methods have proposed to intrinsically motivate an agent to seek novel states, driving the agent to discover improved reward. However, while state-novelty exploration methods are suitable for tasks where novel observations correlate well with improved reward, they may not explore more efficiently than epsilon-greedy approaches in environments where the two are not well-correlated. In this paper, we distinguish between exploration tasks in which seeking novel states aids in finding new reward, and those where it does not, such as goal-conditioned tasks and escaping local reward maxima. We propose a new exploration objective, maximizing the reward prediction error (RPE) of a value function trained to predict extrinsic reward. We then propose a deep reinforcement learning method, QXplore, which exploits the temporal difference error of a Q-function to solve hard exploration tasks in high-dimensional MDPs. We demonstrate the exploration behavior of QXplore on several OpenAI Gym MuJoCo tasks and Atari games and observe that QXplore is comparable to or better than a baseline state-novelty method in all cases, outperforming the baseline on tasks where state novelty is not well-correlated with improved reward.


2016 ◽  
Vol 22 (3) ◽  
pp. 303-313 ◽  
Author(s):  
Katherine Osborne-Crowley ◽  
Skye McDonald ◽  
Jacqueline A. Rushby

AbstractObjectives: The current study aimed to determine whether reversal learning impairments and feedback-related negativity (FRN), reflecting reward prediction error signals generated by negative feedback during the reversal learning tasks, were associated with social disinhibition in a group of participants with traumatic brain injury (TBI). Methods: Number of reversal errors on a social and a non-social reversal learning task and FRN were examined for 21 participants with TBI and 21 control participants matched for age. Participants with TBI were also divided into low and high disinhibition groups based on rated videotaped interviews. Results: Participants with TBI made more reversal errors and produced smaller amplitude FRNs than controls. Furthermore, participants with TBI high on social disinhibition made more reversal errors on the social reversal learning task than did those low on social disinhibition. FRN amplitude was not related to disinhibition. Conclusions: These results suggest that impairment in the ability to update behavior when social reinforcement contingencies change plays a role in social disinhibition after TBI. Furthermore, the social reversal learning task used in this study may be a useful neuropsychological tool for detecting susceptibility to acquired social disinhibition following TBI. Finally, that the FRN amplitude was not associated with social disinhibition suggests that reward prediction error signals are not critical for behavioral adaptation in the social domain. (JINS, 2016, 21, 303–313)


Sign in / Sign up

Export Citation Format

Share Document