A confidence-based reinforcement learning model for perceptual learning

AbstractIt is well established that learning can occur without external feedback, yet normative reinforcement learning theories have difficulties explaining such instances of learning. Recently, we reported on a confidence-based reinforcement learn-ing model for the model case of perceptual learning (Guggenmos, Wilbertz, Hebart, & Sterzer, 2016), according to which the brain capitalizes on internal monitoring processes when no external feedback is available. In the model, internal confidence prediction errors – the difference between current confidence and expected confidence – serve as teaching signals to guide learning. In the present paper, we explore an extension to this idea. The main idea is that the neural information processing pathways activated for a given sensory stimulus are subject to fluctuations, where some pathway configurations lead to higher confidence than others. Confidence prediction errors strengthen pathway configurations for which fluctuations lead to above-average confidence and weaken those that are associated with below-average con-fidence. We show through simulation that the model is capable of self-reinforced perceptual learning and can benefit from exploratory network fluctuations. In addition, by simulating different model parameters, we show that the ideal confidence-based learner should (i) exhibit high degrees of network fluctuation in the initial phase of learning, but re-duced fluctuations as learning progresses, (ii) have a high learning rate for network updates for immediate performance gains, but a low learning rate for long-term maximum performance, and (iii) be neither too under-nor too overconfident. In sum, we present a model in which confidence prediction errors strengthen favorable network fluctuations and enable learning in the absence of external feedback. The model can be readily applied to data of real-world perceptual tasks in which observers provided both choice and confidence reports.

Download Full-text

A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning

eLife ◽

10.7554/elife.12029 ◽

2015 ◽

Vol 4 ◽

Cited By ~ 30

Author(s):

Nicholas T Franklin ◽

Michael J Frank

Keyword(s):

Reinforcement Learning ◽

Computational Models ◽

Neural Model ◽

Learning Rate ◽

Prediction Errors ◽

Effective Learning ◽

Stochastic Environments ◽

Spiny Neurons ◽

Alternative Action ◽

Responsiveness To Change

Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments.

Download Full-text

Learning and Forgetting Using Reinforced Bayesian Change Detection

10.1101/294959 ◽

2018 ◽

Author(s):

Vincent Moens ◽

Alexandre Zénon

Keyword(s):

Reinforcement Learning ◽

Change Detection ◽

Latent Variables ◽

Environmental Changes ◽

Real Data ◽

Learning Rate ◽

Environmental Stability ◽

Model Parameters ◽

Learning Rates ◽

Prolonged Training

AbstractAgents living in volatile environments must be able to detect changes in contingencies while refraining to adapt to unexpected events that are caused by noise. In Reinforcement Learning (RL) frameworks, this requires learning rates that adapt to past reliability of the model. The observation that behavioural flexibility in animals tends to decrease following prolonged training in stable environment provides experimental evidence for such adaptive learning rates. However, in classical RL models, learning rate is either fixed or scheduled and can thus not adapt dynamically to environmental changes. Here, we propose a new Bayesian learning model, using variational inference, that achieves adaptive change detection by the use of Stabilized Forgetting, updating its current belief based on a mixture of fixed, initial priors and previous posterior beliefs. The weight given to these two sources is optimized alongside the other parameters, allowing the model to adapt dynamically to changes in environmental volatility and to unexpected observations. This approach is used to implement the “critic” of an actor-critic RL model, while the actor samples the resulting value distributions to choose which action to undertake. We show that our model can emulate different adaptation strategies to contingency changes, depending on its prior assumptions of environmental stability, and that model parameters can be fit to real data with high accuracy. The model also exhibits trade-offs between flexibility and computational costs that mirror those observed in real data. Overall, the proposed method provides a general framework to study learning flexibility and decision making in RL contexts.Author summaryIn stable contexts, animals and humans exhibit automatic behaviour that allows them to make fast decisions. However, these automatic processes exhibit a lack of flexibility when environmental contingencies change. In the present paper, we propose a model of behavioural automatization that is based on adaptive forgetting and that emulates these properties. The model builds an estimate of the stability of the environment and uses this estimate to adjust its learning rate and the balance between exploration and exploitation policies. The model performs Bayesian inference on latent variables that represent relevant environmental properties, such as reward functions, optimal policies or environment stability. From there, the model makes decisions in order to maximize long-term rewards, with a noise proportional to environmental uncertainty. This rich model encompasses many aspects of Reinforcement Learning (RL), such as Temporal Difference RL and counterfactual learning, and accounts for the reduced computational cost of automatic behaviour. Using simulations, we show that this model leads to interesting predictions about the efficiency with which subjects adapt to sudden change of contingencies after prolonged training.

Download Full-text

Signed and unsigned reward prediction errors dynamically enhance learning and memory

eLife ◽

10.7554/elife.61077 ◽

2021 ◽

Vol 10 ◽

Author(s):

Nina Rouhani ◽

Yael Niv

Keyword(s):

Reinforcement Learning ◽

Locus Coeruleus ◽

Learning And Memory ◽

Learning Rate ◽

Prediction Errors ◽

Learning Models ◽

The Past ◽

Reward Prediction ◽

Midbrain Dopamine ◽

Reinforcement Learning Models

Memory helps guide behavior, but which experiences from the past are prioritized? Classic models of learning posit that events associated with unpredictable outcomes as well as, paradoxically, predictable outcomes, deploy more attention and learning for those events. Here, we test reinforcement learning and subsequent memory for those events, and treat signed and unsigned reward prediction errors (RPEs), experienced at the reward-predictive cue or reward outcome, as drivers of these two seemingly contradictory signals. By fitting reinforcement learning models to behavior, we find that both RPEs contribute to learning by modulating a dynamically changing learning rate. We further characterize the effects of these RPE signals on memory, and show that both signed and unsigned RPEs enhance memory, in line with midbrain dopamine and locus-coeruleus modulation of hippocampal plasticity, thereby reconciling separate findings in the literature.

Download Full-text

Reward prediction errors induce risk-seeking

10.1101/2020.04.29.067751 ◽

2020 ◽

Author(s):

Moritz Moeller ◽

Jan Grohn ◽

Sanjay Manohar ◽

Rafal Bogacz

Keyword(s):

Reinforcement Learning ◽

Prediction Error ◽

Risk Preferences ◽

Behavioral Effect ◽

Learning Task ◽

Expected Value ◽

Learning Theories ◽

Prediction Errors ◽

Risk Seeking ◽

The Difference

AbstractReinforcement learning theories propose that humans choose based on the estimated values of available options, and that they learn from rewards by reducing the difference between the experienced and expected value. In the brain, such prediction errors are broadcasted by dopamine. However, choices are not only influenced by expected value, but also by risk. Like reinforcement learning, risk preferences are modulated by dopamine: enhanced dopamine levels induce risk-seeking. Learning and risk preferences have so far been studied independently, even though it is commonly assumed that they are (partly) regulated by the same neurotransmitter. Here, we use a novel learning task to look for prediction-error induced risk-seeking in human behavior and pupil responses. We find that prediction errors are positively correlated with risk-preferences in imminent choices. Physiologically, this effect is indexed by pupil dilation: only participants whose pupil response indicates that they experienced the prediction error also show the behavioral effect.

Download Full-text

Mesolimbic confidence signals guide perceptual learning in the absence of external feedback

eLife ◽

10.7554/elife.13388 ◽

2016 ◽

Vol 5 ◽

Cited By ~ 53

Author(s):

Matthias Guggenmos ◽

Gregor Wilbertz ◽

Martin N Hebart ◽

Philipp Sterzer

Keyword(s):

Perceptual Learning ◽

Prediction Error ◽

Learning Task ◽

Learning Theories ◽

External Feedback ◽

Visual Perceptual Learning ◽

Learning Success ◽

Decision Variables ◽

Monitoring Process ◽

External Reward

It is well established that learning can occur without external feedback, yet normative reinforcement learning theories have difficulties explaining such instances of learning. Here, we propose that human observers are capable of generating their own feedback signals by monitoring internal decision variables. We investigated this hypothesis in a visual perceptual learning task using fMRI and confidence reports as a measure for this monitoring process. Employing a novel computational model in which learning is guided by confidence-based reinforcement signals, we found that mesolimbic brain areas encoded both anticipation and prediction error of confidence—in remarkable similarity to previous findings for external reward-based feedback. We demonstrate that the model accounts for choice and confidence reports and show that the mesolimbic confidence prediction error modulation derived through the model predicts individual learning success. These results provide a mechanistic neurobiological explanation for learning without external feedback by augmenting reinforcement models with confidence-based feedback.

Download Full-text

The value of confidence: Confidence prediction errors drive value-based learning in the absence of external feedback

10.31234/osf.io/wmv89 ◽

2021 ◽

Author(s):

Lena Esther Ptasczynski ◽

Isa Steinecker ◽

Philipp Sterzer ◽

Matthias Guggenmos

Keyword(s):

Reinforcement Learning ◽

Instrumental Conditioning ◽

Prediction Errors ◽

Learning Models ◽

Confidence Levels ◽

External Feedback ◽

A Value ◽

Subjective Confidence ◽

Reinforcement Learning Models

Reinforcement learning algorithms have a long-standing success story in explaining the dynamics of instrumental conditioning in humans and other species. While normative reinforcement learning models are critically dependent on external feedback, recent findings in the field of perceptual learning point to a crucial role of internally-generated reinforcement signals based on subjective confidence, when external feedback is not available. Here, we investigated the existence of such confidence-based learning signals in a key domain of reinforcement-based learning: instrumental conditioning. We conducted a value-based decision making experiment which included phases with and without external feedback and in which participants reported their confidence in addition to choices. Behaviorally, we found signatures of self-reinforcement in phases without feedback, reflected in an increase of subjective confidence and choice consistency. To clarify the mechanistic role of confidence in value-based learning, we compared a family of confidence-based learning models with more standard models predicting either no change in value estimates or a devaluation over time when no external reward is provided. We found that confidence-based models indeed outperformed these reference models, whereby the learning signal of the winning model was based on the prediction error between current confidence and a stimulus-unspecific average of previous confidence levels. Interestingly, individuals with more volatile reward-based value updates in the presence of feedback also showed more volatile confidence-based value updates when feedback was not available. Together, our results provide evidence that confidence-based learning signals affect instrumentally learned subjective values in the absence of external feedback.

Download Full-text

A Reinforcement Learning Framework for Spiking Networks with Dynamic Synapses

Computational Intelligence and Neuroscience ◽

10.1155/2011/869348 ◽

2011 ◽

Vol 2011 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Karim El-Laithy ◽

Martin Bogdan

Keyword(s):

Reinforcement Learning ◽

Spike Timing ◽

Neural Representation ◽

Model Parameters ◽

Learning Framework ◽

Reference Target ◽

Wide Range ◽

Spiking Network ◽

Dynamic Synapses ◽

Exclusive Or

An integration of both the Hebbian-based and reinforcement learning (RL) rules is presented for dynamic synapses. The proposed framework permits the Hebbian rule to update the hidden synaptic model parameters regulating the synaptic response rather than the synaptic weights. This is performed using both the value and the sign of the temporal difference in the reward signal after each trial. Applying this framework, a spiking network with spike-timing-dependent synapses is tested to learn the exclusive-OR computation on a temporally coded basis. Reward values are calculated with the distance between the output spike train of the network and a reference target one. Results show that the network is able to capture the required dynamics and that the proposed framework can reveal indeed an integrated version of Hebbian and RL. The proposed framework is tractable and less computationally expensive. The framework is applicable to a wide class of synaptic models and is not restricted to the used neural representation. This generality, along with the reported results, supports adopting the introduced approach to benefit from the biologically plausible synaptic models in a wide range of intuitive signal processing.

Download Full-text

Language statistical learning responds to reinforcement learning principles rooted in the striatum

PLoS Biology ◽

10.1371/journal.pbio.3001119 ◽

2021 ◽

Vol 19 (9) ◽

pp. e3001119

Author(s):

Joan Orpella ◽

Ernest Mas-Herrero ◽

Pablo Ripollés ◽

Josep Marco-Pallarés ◽

Ruth de Diego-Balaguer

Keyword(s):

Reinforcement Learning ◽

Language Learning ◽

Statistical Learning ◽

Dorsal Striatum ◽

Rule Learning ◽

Prediction Errors ◽

Neural Basis ◽

Structural Rules ◽

Learning Principles ◽

Striatal Function

Statistical learning (SL) is the ability to extract regularities from the environment. In the domain of language, this ability is fundamental in the learning of words and structural rules. In lack of reliable online measures, statistical word and rule learning have been primarily investigated using offline (post-familiarization) tests, which gives limited insights into the dynamics of SL and its neural basis. Here, we capitalize on a novel task that tracks the online SL of simple syntactic structures combined with computational modeling to show that online SL responds to reinforcement learning principles rooted in striatal function. Specifically, we demonstrate—on 2 different cohorts—that a temporal difference model, which relies on prediction errors, accounts for participants’ online learning behavior. We then show that the trial-by-trial development of predictions through learning strongly correlates with activity in both ventral and dorsal striatum. Our results thus provide a detailed mechanistic account of language-related SL and an explanation for the oft-cited implication of the striatum in SL tasks. This work, therefore, bridges the long-standing gap between language learning and reinforcement learning phenomena.

Download Full-text

Prefrontal solution to the bias-variance tradeoff during reinforcement learning

10.1101/2020.12.23.424258 ◽

2020 ◽

Author(s):

Dongjae Kim ◽

Jaeseung Jeong ◽

Sang Wan Lee

Keyword(s):

Adaptive Control ◽

Reinforcement Learning ◽

Prediction Error ◽

Brain Regions ◽

Decision Task ◽

Prediction Errors ◽

Model Based ◽

Model Free ◽

Bias Variance ◽

The Brain

AbstractThe goal of learning is to maximize future rewards by minimizing prediction errors. Evidence have shown that the brain achieves this by combining model-based and model-free learning. However, the prediction error minimization is challenged by a bias-variance tradeoff, which imposes constraints on each strategy’s performance. We provide new theoretical insight into how this tradeoff can be resolved through the adaptive control of model-based and model-free learning. The theory predicts the baseline correction for prediction error reduces the lower bound of the bias–variance error by factoring out irreducible noise. Using a Markov decision task with context changes, we showed behavioral evidence of adaptive control. Model-based behavioral analyses show that the prediction error baseline signals context changes to improve adaptability. Critically, the neural results support this view, demonstrating multiplexed representations of prediction error baseline within the ventrolateral and ventromedial prefrontal cortex, key brain regions known to guide model-based and model-free learning.One sentence summaryA theoretical, behavioral, computational, and neural account of how the brain resolves the bias-variance tradeoff during reinforcement learning is described.

Download Full-text

Attenuated directed exploration during reinforcement learning in gambling disorder

10.1101/823583 ◽

2019 ◽

Cited By ~ 3

Author(s):

A. Wiehler ◽

K. Chakroun ◽

J. Peters

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Gambling Disorder ◽

Brain Activity ◽

Clinical Status ◽

Classical Problem ◽

Behavioral Flexibility ◽

Network Connectivity ◽

Prediction Errors ◽

Reward Contingencies

AbstractGambling disorder is a behavioral addiction associated with impairments in decision-making and reduced behavioral flexibility. Decision-making in volatile environments requires a flexible trade-off between exploitation of options with high expected values and exploration of novel options to adapt to changing reward contingencies. This classical problem is known as the exploration-exploitation dilemma. We hypothesized gambling disorder to be associated with a specific reduction in directed (uncertainty-based) exploration compared to healthy controls, accompanied by changes in brain activity in a fronto-parietal exploration-related network.Twenty-three frequent gamblers and nineteen matched controls performed a classical four-armed bandit task during functional magnetic resonance imaging. Computational modeling revealed that choice behavior in both groups contained signatures of directed exploration, random exploration and perseveration. Gamblers showed a specific reduction in directed exploration, while random exploration and perseveration were similar between groups.Neuroimaging revealed no evidence for group differences in neural representations of expected value and reward prediction errors. Likewise, our hypothesis of attenuated fronto-parietal exploration effects in gambling disorder was not supported. However, during directed exploration, gamblers showed reduced parietal and substantia nigra / ventral tegmental area activity. Cross-validated classification analyses revealed that connectivity in an exploration-related network was predictive of clinical status, suggesting alterations in network dynamics in gambling disorder.In sum, we show that reduced flexibility during reinforcement learning in volatile environments in gamblers is attributable to a reduction in directed exploration rather than an increase in perseveration. Neuroimaging findings suggest that patterns of network connectivity might be more diagnostic of gambling disorder than univariate value and prediction error effects. We provide a computational account of flexibility impairments in gamblers during reinforcement learning that might arise as a consequence of dopaminergic dysregulation in this disorder.

Download Full-text