Differential reinforcement encoding along the hippocampal long axis helps resolve the explore/exploit dilemma

ABSTRACTWhen making decisions, should one exploit known good options or explore potentially better alternatives? Exploration of spatially unstructured options depends on the neocortex, striatum, and amygdala. In natural environments, however, better options often cluster together, forming structured value distributions. The hippocampus binds reward information into allocentric cognitive maps to support navigation and foraging in such spaces. Using a reinforcement learning task with a spatially structured reward function, we show that human posterior hippocampus (PH) invigorates exploration while anterior hippocampus (AH) supports the transition to exploitation. These dynamics depend on differential reinforcement representations in the PH and AH. Whereas local reward prediction error signals are early and phasic in the PH tail, global value maximum signals are delayed and sustained in the AH body. AH compresses reinforcement information across episodes, updating the location and prominence of the value maximum and displaying goal cell-like ramping activity when navigating toward it.

Download Full-text

Differential reinforcement encoding along the hippocampal long axis helps resolve the explore–exploit dilemma

Nature Communications ◽

10.1038/s41467-020-18864-0 ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Alexandre Y. Dombrovski ◽

Beatriz Luna ◽

Michael N. Hallquist

Keyword(s):

Reinforcement Learning ◽

Prediction Error ◽

Differential Reinforcement ◽

Cognitive Maps ◽

Learning Task ◽

Natural Environments ◽

Reward Prediction Error ◽

Reward Function ◽

Reward Prediction ◽

Reward Information

Abstract When making decisions, should one exploit known good options or explore potentially better alternatives? Exploration of spatially unstructured options depends on the neocortex, striatum, and amygdala. In natural environments, however, better options often cluster together, forming structured value distributions. The hippocampus binds reward information into allocentric cognitive maps to support navigation and foraging in such spaces. Here we report that human posterior hippocampus (PH) invigorates exploration while anterior hippocampus (AH) supports the transition to exploitation on a reinforcement learning task with a spatially structured reward function. These dynamics depend on differential reinforcement representations in the PH and AH. Whereas local reward prediction error signals are early and phasic in the PH tail, global value maximum signals are delayed and sustained in the AH body. AH compresses reinforcement information across episodes, updating the location and prominence of the value maximum and displaying goal cell-like ramping activity when navigating toward it.

Download Full-text

Subjective and model-estimated reward prediction: Association with the feedback-related negativity (FRN) and reward prediction error in a reinforcement learning task

International Journal of Psychophysiology ◽

10.1016/j.ijpsycho.2010.09.001 ◽

2010 ◽

Vol 78 (3) ◽

pp. 273-283 ◽

Cited By ~ 15

Author(s):

Naho Ichikawa ◽

Greg J. Siegle ◽

Alexandre Dombrovski ◽

Hideki Ohira

Keyword(s):

Reinforcement Learning ◽

Prediction Error ◽

Learning Task ◽

Reward Prediction Error ◽

Reward Prediction

Download Full-text

Behavioural and computational evidence for memory consolidation biased by reward-prediction errors

10.1101/716290 ◽

2019 ◽

Author(s):

Emma L. Roscow ◽

Matthew W. Jones ◽

Nathan F. Lepora

Keyword(s):

Reinforcement Learning ◽

Prediction Error ◽

Learning Task ◽

Male Rats ◽

Adaptive Behaviour ◽

Prediction Errors ◽

Reward Prediction Error ◽

Reward Prediction ◽

Per Se ◽

Reinforcement Learning Model

AbstractNeural activity encoding recent experiences is replayed during sleep and rest to promote consolidation of the corresponding memories. However, precisely which features of experience influence replay prioritisation to optimise adaptive behaviour remains unclear. Here, we trained adult male rats on a novel maze-based rein-forcement learning task designed to dissociate reward outcomes from reward-prediction errors. Four variations of a reinforcement learning model were fitted to the rats’ behaviour over multiple days. Behaviour was best predicted by a model incorporating replay biased by reward-prediction error, compared to the same model with no replay; random replay or reward-biased replay produced poorer predictions of behaviour. This insight disentangles the influences of salience on replay, suggesting that reinforcement learning is tuned by post-learning replay biased by reward-prediction error, not by reward per se. This work therefore provides a behavioural and theoretical toolkit with which to measure and interpret replay in striatal, hippocampal and neocortical circuits.

Download Full-text

Frontal Theta Oscillatory Activity Is a Common Mechanism for the Computation of Unexpected Outcomes and Learning Rate

Journal of Cognitive Neuroscience ◽

10.1162/jocn_a_00516 ◽

2014 ◽

Vol 26 (3) ◽

pp. 447-458 ◽

Cited By ~ 39

Author(s):

Ernest Mas-Herrero ◽

Josep Marco-Pallarés

Keyword(s):

Prediction Error ◽

Learning Task ◽

Learning Rate ◽

Oscillatory Activity ◽

Common Mechanism ◽

Prediction Errors ◽

Reward Prediction Error ◽

Reward Prediction ◽

Medial Pfc ◽

The Impact

In decision-making processes, the relevance of the information yielded by outcomes varies across time and situations. It increases when previous predictions are not accurate and in contexts with high environmental uncertainty. Previous fMRI studies have shown an important role of medial pFC in coding both reward prediction errors and the impact of this information to guide future decisions. However, it is unclear whether these two processes are dissociated in time or occur simultaneously, suggesting that a common mechanism is engaged. In the present work, we studied the modulation of two electrophysiological responses associated to outcome processing—the feedback-related negativity ERP and frontocentral theta oscillatory activity—with the reward prediction error and the learning rate. Twenty-six participants performed two learning tasks differing in the degree of predictability of the outcomes: a reversal learning task and a probabilistic learning task with multiple blocks of novel cue–outcome associations. We implemented a reinforcement learning model to obtain the single-trial reward prediction error and the learning rate for each participant and task. Our results indicated that midfrontal theta activity and feedback-related negativity increased linearly with the unsigned prediction error. In addition, variations of frontal theta oscillatory activity predicted the learning rate across tasks and participants. These results support the existence of a common brain mechanism for the computation of unsigned prediction error and learning rate.

Download Full-text

Discrete coding of stimulus value, reward expectation, and reward prediction error in the dorsal striatum

Journal of Neurophysiology ◽

10.1152/jn.00097.2015 ◽

2015 ◽

Vol 114 (5) ◽

pp. 2600-2615 ◽

Cited By ~ 14

Author(s):

Kei Oyama ◽

Yukina Tateyama ◽

István Hernádi ◽

Philippe N. Tobler ◽

Toshio Iijima ◽

...

Keyword(s):

Prediction Error ◽

Sensory Information ◽

Dorsal Striatum ◽

Striatal Neurons ◽

Reward Prediction Error ◽

Reward Prediction ◽

Reward Probability ◽

Stimulus Value ◽

Reward Contingencies ◽

Reward Information

To investigate how the striatum integrates sensory information with reward information for behavioral guidance, we recorded single-unit activity in the dorsal striatum of head-fixed rats participating in a probabilistic Pavlovian conditioning task with auditory conditioned stimuli (CSs) in which reward probability was fixed for each CS but parametrically varied across CSs. We found that the activity of many neurons was linearly correlated with the reward probability indicated by the CSs. The recorded neurons could be classified according to their firing patterns into functional subtypes coding reward probability in different forms such as stimulus value, reward expectation, and reward prediction error. These results suggest that several functional subgroups of dorsal striatal neurons represent different kinds of information formed through extensive prior exposure to CS-reward contingencies.

Download Full-text

Correction for Glimcher, Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1114363108 ◽

2011 ◽

Vol 108 (42) ◽

pp. 17568-17569 ◽

Cited By ~ 3

Keyword(s):

Reinforcement Learning ◽

Prediction Error ◽

Reward Prediction Error ◽

Reward Prediction

Download Full-text

Momentary subjective well-being depends on learning and not reward

eLife ◽

10.7554/elife.57977 ◽

2020 ◽

Vol 9 ◽

Author(s):

Bastien Blain ◽

Robb B Rutledge

Keyword(s):

Prediction Error ◽

Computational Modelling ◽

Well Being ◽

Learning Task ◽

Adaptive Behaviour ◽

Subjective Well Being ◽

Reward Prediction Error ◽

Reward Prediction ◽

The Difference ◽

Momentary Happiness

Subjective well-being or happiness is often associated with wealth. Recent studies suggest that momentary happiness is associated with reward prediction error, the difference between experienced and predicted reward, a key component of adaptive behaviour. We tested subjects in a reinforcement learning task in which reward size and probability were uncorrelated, allowing us to dissociate between the contributions of reward and learning to happiness. Using computational modelling, we found convergent evidence across stable and volatile learning tasks that happiness, like behaviour, is sensitive to learning-relevant variables (i.e. probability prediction error). Unlike behaviour, happiness is not sensitive to learning-irrelevant variables (i.e. reward prediction error). Increasing volatility reduces how many past trials influence behaviour but not happiness. Finally, depressive symptoms reduce happiness more in volatile than stable environments. Our results suggest that how we learn about our world may be more important for how we feel than the rewards we actually receive.

Download Full-text

Dopamine mediates the bidirectional update of interval timing

10.1101/2021.11.02.466803 ◽

2021 ◽

Author(s):

Anthony M.V. Jakob ◽

John G Mikhael ◽

Allison E Hamilos ◽

John A Assad ◽

Samuel J Gershman

Keyword(s):

Reinforcement Learning ◽

Prediction Error ◽

Interval Timing ◽

Substantia Nigra Pars Compacta ◽

Subjective Time ◽

Reward Prediction Error ◽

Learning Tasks ◽

Reward Prediction ◽

Reward Delivery ◽

Speed Up

The role of dopamine as a reward prediction error signal in reinforcement learning tasks has been well-established over the past decades. Recent work has shown that the reward prediction error interpretation can also account for the effects of dopamine on interval timing by controlling the speed of subjective time. According to this theory, the timing of the dopamine signal relative to reward delivery dictates whether subjective time speeds up or slows down: Early DA signals speed up subjective time and late signals slow it down. To test this bidirectional prediction, we reanalyzed measurements of dopaminergic neurons in the substantia nigra pars compacta of mice performing a self-timed movement task. Using the slope of ramping dopamine activity as a read-out of subjective time speed, we found that trial-by-trial changes in the slope could be predicted from the timing of dopamine activity on the previous trial. This result provides a key piece of evidence supporting a unified computational theory of reinforcement learning and interval timing.

Download Full-text

Reward Prediction Error as an Exploration Objective in Deep RL

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/390 ◽

2020 ◽

Author(s):

Riley Simmons-Edler ◽

Ben Eisner ◽

Daniel Yang ◽

Anthony Bisulco ◽

Eric Mitchell ◽

...

Keyword(s):

Reinforcement Learning ◽

Prediction Error ◽

Value Function ◽

Reward Prediction Error ◽

A Value ◽

Reward Prediction ◽

Exploration Behavior ◽

Q Function ◽

Baseline State ◽

Better Than

A major challenge in reinforcement learning is exploration, when local dithering methods such as epsilon-greedy sampling are insufficient to solve a given task. Many recent methods have proposed to intrinsically motivate an agent to seek novel states, driving the agent to discover improved reward. However, while state-novelty exploration methods are suitable for tasks where novel observations correlate well with improved reward, they may not explore more efficiently than epsilon-greedy approaches in environments where the two are not well-correlated. In this paper, we distinguish between exploration tasks in which seeking novel states aids in finding new reward, and those where it does not, such as goal-conditioned tasks and escaping local reward maxima. We propose a new exploration objective, maximizing the reward prediction error (RPE) of a value function trained to predict extrinsic reward. We then propose a deep reinforcement learning method, QXplore, which exploits the temporal difference error of a Q-function to solve hard exploration tasks in high-dimensional MDPs. We demonstrate the exploration behavior of QXplore on several OpenAI Gym MuJoCo tasks and Atari games and observe that QXplore is comparable to or better than a baseline state-novelty method in all cases, outperforming the baseline on tasks where state novelty is not well-correlated with improved reward.

Download Full-text

Role of Reversal Learning Impairment in Social Disinhibition following Severe Traumatic Brain Injury

Journal of the International Neuropsychological Society ◽

10.1017/s1355617715001277 ◽

2016 ◽

Vol 22 (3) ◽

pp. 303-313 ◽

Cited By ~ 4

Author(s):

Katherine Osborne-Crowley ◽

Skye McDonald ◽

Jacqueline A. Rushby

Keyword(s):

Traumatic Brain Injury ◽

Brain Injury ◽

Reversal Learning ◽

Prediction Error ◽

Learning Task ◽

Reward Prediction Error ◽

Reward Prediction ◽

The Social ◽

Learning Impairment ◽

Reversal Errors

AbstractObjectives: The current study aimed to determine whether reversal learning impairments and feedback-related negativity (FRN), reflecting reward prediction error signals generated by negative feedback during the reversal learning tasks, were associated with social disinhibition in a group of participants with traumatic brain injury (TBI). Methods: Number of reversal errors on a social and a non-social reversal learning task and FRN were examined for 21 participants with TBI and 21 control participants matched for age. Participants with TBI were also divided into low and high disinhibition groups based on rated videotaped interviews. Results: Participants with TBI made more reversal errors and produced smaller amplitude FRNs than controls. Furthermore, participants with TBI high on social disinhibition made more reversal errors on the social reversal learning task than did those low on social disinhibition. FRN amplitude was not related to disinhibition. Conclusions: These results suggest that impairment in the ability to update behavior when social reinforcement contingencies change plays a role in social disinhibition after TBI. Furthermore, the social reversal learning task used in this study may be a useful neuropsychological tool for detecting susceptibility to acquired social disinhibition following TBI. Finally, that the FRN amplitude was not associated with social disinhibition suggests that reward prediction error signals are not critical for behavioral adaptation in the social domain. (JINS, 2016, 21, 303–313)

Download Full-text