From internal models toward metacognitive AI

Biological Cybernetics ◽

10.1007/s00422-021-00904-7 ◽

2021 ◽

Author(s):

Mitsuo Kawato ◽

Aurelio Cortese

Keyword(s):

Reinforcement Learning ◽

Computational Models ◽

Monitoring Network ◽

Internal Models ◽

Inverse Model ◽

Small Samples ◽

Prediction Errors ◽

Hierarchical Reinforcement Learning ◽

Reward Prediction ◽

Higher Cognitive Functions

AbstractIn several papers published in Biological Cybernetics in the 1980s and 1990s, Kawato and colleagues proposed computational models explaining how internal models are acquired in the cerebellum. These models were later supported by neurophysiological experiments using monkeys and neuroimaging experiments involving humans. These early studies influenced neuroscience from basic, sensory-motor control to higher cognitive functions. One of the most perplexing enigmas related to internal models is to understand the neural mechanisms that enable animals to learn large-dimensional problems with so few trials. Consciousness and metacognition—the ability to monitor one’s own thoughts, may be part of the solution to this enigma. Based on literature reviews of the past 20 years, here we propose a computational neuroscience model of metacognition. The model comprises a modular hierarchical reinforcement-learning architecture of parallel and layered, generative-inverse model pairs. In the prefrontal cortex, a distributed executive network called the “cognitive reality monitoring network” (CRMN) orchestrates conscious involvement of generative-inverse model pairs in perception and action. Based on mismatches between computations by generative and inverse models, as well as reward prediction errors, CRMN computes a “responsibility signal” that gates selection and learning of pairs in perception, action, and reinforcement learning. A high responsibility signal is given to the pairs that best capture the external world, that are competent in movements (small mismatch), and that are capable of reinforcement learning (small reward-prediction error). CRMN selects pairs with higher responsibility signals as objects of metacognition, and consciousness is determined by the entropy of responsibility signals across all pairs. This model could lead to new-generation AI, which exhibits metacognition, consciousness, dimension reduction, selection of modules and corresponding representations, and learning from small samples. It may also lead to the development of a new scientific paradigm that enables the causal study of consciousness by combining CRMN and decoded neurofeedback.

Download Full-text

A history-derived reward prediction error signal in ventral pallidum

10.1101/807842 ◽

2019 ◽

Author(s):

David J. Ottenheimer ◽

Bilal A. Bari ◽

Elissa Sutlief ◽

Kurt M. Fraser ◽

Tabitha H. Kim ◽

...

Keyword(s):

Reinforcement Learning ◽

Computational Models ◽

Ventral Pallidum ◽

Neural Population ◽

Learning Activity ◽

Prediction Errors ◽

Dopamine System ◽

Reward Seeking ◽

Reward Prediction ◽

Midbrain Dopamine

ABSTRACTLearning from past interactions with the environment is critical for adaptive behavior. Within the framework of reinforcement learning, the nervous system builds expectations about future reward by computing reward prediction errors (RPEs), the difference between actual and predicted rewards. Correlates of RPEs have been observed in the midbrain dopamine system, which is thought to locally compute this important variable in service of learning. However, the extent to which RPE signals may be computed upstream of the dopamine system is largely unknown. Here, we quantify history-based RPE signals in the ventral pallidum (VP), an input region to the midbrain dopamine system implicated in reward-seeking behavior. We trained rats to associate cues with future delivery of reward and fit computational models to predict individual neuron firing rates at the time of reward delivery. We found that a subset of VP neurons encoded RPEs and did so more robustly than nucleus accumbens, an input to VP. VP RPEs predicted trial-by-trial task engagement, and optogenetic inhibition of VP reduced subsequent task-related reward seeking. Consistent with reinforcement learning, activity of VP RPE cells adapted when rewards were delivered in blocks. We further found that history- and cue-based RPEs were largely separate across the VP neural population. The presence of behaviorally-instructive RPE signals in the VP suggests a pivotal role for this region in value-based computations.

Download Full-text

The best laid plans: Computational principles of ACC

10.31234/osf.io/3df8y ◽

2020 ◽

Cited By ~ 1

Author(s):

Clay B. Holroyd ◽

Tom Verguts

Keyword(s):

Reinforcement Learning ◽

Anterior Cingulate Cortex ◽

Hierarchical Model ◽

Computational Models ◽

Cingulate Cortex ◽

Anterior Cingulate ◽

Hierarchical Reinforcement Learning ◽

The Past ◽

Model Based ◽

High Level

Despite continual debate for the past thirty years about the function of anterior cingulate cortex (ACC), its key contribution to neurocognition remains unknown. Here we review computational models that illustrate three core principles of ACC function (related to hierarchy, world models and cost), as well as four constraints on the neural implementation of these principles (related to modularity, binding, encoding and learning and regulation). These observations suggest a role for ACC in hierarchical model-based hierarchical reinforcement learning, which instantiates a mechanism for motivating the execution of high-level plans.

Download Full-text

Dopamine transients delivered in learning contexts do not act as model-free prediction errors

10.1101/574541 ◽

2019 ◽

Cited By ~ 3

Author(s):

Melissa J. Sharpe ◽

Hannah M. Batchelor ◽

Lauren E. Mueller ◽

Chun Yun Chang ◽

Etienne J.P. Maes ◽

...

Keyword(s):

Reinforcement Learning ◽

Associative Learning ◽

Prediction Error ◽

Error Term ◽

Neural Correlates ◽

Dopamine Neurons ◽

Prediction Errors ◽

Model Free ◽

Reward Prediction ◽

Excess Value

AbstractDopamine neurons fire transiently in response to unexpected rewards. These neural correlates are proposed to signal the reward prediction error described in model-free reinforcement learning algorithms. This error term represents the unpredicted or ‘excess’ value of the rewarding event. In model-free reinforcement learning, this value is then stored as part of the learned value of any antecedent cues, contexts or events, making them intrinsically valuable, independent of the specific rewarding event that caused the prediction error. In support of equivalence between dopamine transients and this model-free error term, proponents cite causal optogenetic studies showing that artificially induced dopamine transients cause lasting changes in behavior. Yet none of these studies directly demonstrate the presence of cached value under conditions appropriate for associative learning. To address this gap in our knowledge, we conducted three studies where we optogenetically activated dopamine neurons while rats were learning associative relationships, both with and without reward. In each experiment, the antecedent cues failed to acquired value and instead entered into value-independent associative relationships with the other cues or rewards. These results show that dopamine transients, constrained within appropriate learning situations, support valueless associative learning.

Download Full-text

Hippocampal Contribution to Probabilistic Feedback Learning: Modeling Observation- and Reinforcement-based Processes

10.31234/osf.io/qhb3a ◽

2021 ◽

Author(s):

Virginie Patt ◽

Daniela Palombo ◽

Michael Esterman ◽

Mieke Verfaellie

Keyword(s):

Reinforcement Learning ◽

Monitoring Network ◽

Learning Processes ◽

Learning System ◽

Prediction Errors ◽

Feedback Learning ◽

Probabilistic Reinforcement ◽

Additional Mode ◽

Brain Correlates ◽

The Brain

Simple probabilistic reinforcement learning is recognized as a striatum-based learning system, but in recent years, has also been associated with hippocampal involvement. The present study examined whether such involvement may be attributed to observation-based learning processes, running in parallel to striatum-based reinforcement learning. A computational model of observation-based learning (OL), mirroring classic models of reinforcement-based learning (RL), was constructed and applied to the neuroimaging dataset of Palombo, Hayes, Reid, & Verfaellie (2019). Hippocampal contributions to value-based learning: Converging evidence from fMRI and amnesia. Cognitive, Affective & Behavioral Neuroscience, 19(3), 523–536. Results suggested that observation-based learning processes may indeed take place concomitantly to reinforcement learning and involve activation of the hippocampus and central orbitofrontal cortex (cOFC). However, rather than independent mechanisms running in parallel, the brain correlates of the OL and RL prediction errors indicated collaboration between systems, with direct implication of the hippocampus in computations of the discrepancy between the expected and actual reinforcing values of actions. These findings are consistent with previous accounts of a role for the hippocampus in encoding the strength of observed stimulus-outcome associations, with updating of such associations through striatal reinforcement-based computations. Additionally, enhanced negative prediction error signaling was found in the anterior insula with greater use of OL over RL processes. This result may suggest an additional mode of collaboration between OL and RL systems, implicating the error monitoring network.

Download Full-text

How We Learn to Make Decisions: Rapid Propagation of Reinforcement Learning Prediction Errors in Humans

Journal of Cognitive Neuroscience ◽

10.1162/jocn_a_00509 ◽

2014 ◽

Vol 26 (3) ◽

pp. 635-644 ◽

Cited By ~ 38

Author(s):

Olav E. Krigolson ◽

Cameron D. Hassall ◽

Todd C. Handy

Keyword(s):

Reinforcement Learning ◽

Prediction Error ◽

Human Error ◽

Dopamine Neurons ◽

Prediction Errors ◽

Neural Basis ◽

Error Related Negativity ◽

Reward Positivity ◽

Reward Prediction ◽

Feedback Error

Our ability to make decisions is predicated upon our knowledge of the outcomes of the actions available to us. Reinforcement learning theory posits that actions followed by a reward or punishment acquire value through the computation of prediction errors—discrepancies between the predicted and the actual reward. A multitude of neuroimaging studies have demonstrated that rewards and punishments evoke neural responses that appear to reflect reinforcement learning prediction errors [e.g., Krigolson, O. E., Pierce, L. J., Holroyd, C. B., & Tanaka, J. W. Learning to become an expert: Reinforcement learning and the acquisition of perceptual expertise. Journal of Cognitive Neuroscience, 21, 1833–1840, 2009; Bayer, H. M., & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129–141, 2005; O'Doherty, J. P. Reward representations and reward-related learning in the human brain: Insights from neuroimaging. Current Opinion in Neurobiology, 14, 769–776, 2004; Holroyd, C. B., & Coles, M. G. H. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109, 679–709, 2002]. Here, we used the brain ERP technique to demonstrate that not only do rewards elicit a neural response akin to a prediction error but also that this signal rapidly diminished and propagated to the time of choice presentation with learning. Specifically, in a simple, learnable gambling task, we show that novel rewards elicited a feedback error-related negativity that rapidly decreased in amplitude with learning. Furthermore, we demonstrate the existence of a reward positivity at choice presentation, a previously unreported ERP component that has a similar timing and topography as the feedback error-related negativity that increased in amplitude with learning. The pattern of results we observed mirrored the output of a computational model that we implemented to compute reward prediction errors and the changes in amplitude of these prediction errors at the time of choice presentation and reward delivery. Our results provide further support that the computations that underlie human learning and decision-making follow reinforcement learning principles.

Download Full-text

Inferring reward prediction errors in patients with schizophrenia: a dynamic reward task for reinforcement learning

Frontiers in Psychology ◽

10.3389/fpsyg.2014.01282 ◽

2014 ◽

Vol 5 ◽

Cited By ~ 2

Author(s):

Chia-Tzu Li ◽

Wen-Sung Lai ◽

Chih-Min Liu ◽

Yung-Fong Hsu

Keyword(s):

Reinforcement Learning ◽

Prediction Errors ◽

Reward Prediction

Download Full-text

Principal components analysis of reward prediction errors in a reinforcement learning task

NeuroImage ◽

10.1016/j.neuroimage.2015.07.032 ◽

2016 ◽

Vol 124 ◽

pp. 276-286 ◽

Cited By ~ 29

Author(s):

Thomas D. Sambrook ◽

Jeremy Goslin

Keyword(s):

Reinforcement Learning ◽

Principal Components Analysis ◽

Principal Components ◽

Learning Task ◽

Prediction Errors ◽

Reward Prediction ◽

Components Analysis

Download Full-text

A Normative Account of Confirmation Bias During Reinforcement Learning

Neural Computation ◽

10.1162/neco_a_01455 ◽

2021 ◽

pp. 1-31

Author(s):

Germain Lefebvre ◽

Christopher Summerfield ◽

Rafal Bogacz

Keyword(s):

Reinforcement Learning ◽

Prediction Errors ◽

Option Value ◽

Confirmatory Bias ◽

Learning Rules ◽

Reward Prediction ◽

Wide Range ◽

The Face ◽

Paradoxical Finding ◽

Normative Account

Abstract Reinforcement learning involves updating estimates of the value of states and actions on the basis of experience. Previous work has shown that in humans, reinforcement learning exhibits a confirmatory bias: when the value of a chosen option is being updated, estimates are revised more radically following positive than negative reward prediction errors, but the converse is observed when updating the unchosen option value estimate. Here, we simulate performance on a multi-arm bandit task to examine the consequences of a confirmatory bias for reward harvesting. We report a paradoxical finding: that confirmatory biases allow the agent to maximize reward relative to an unbiased updating rule. This principle holds over a wide range of experimental settings and is most influential when decisions are corrupted by noise. We show that this occurs because on average, confirmatory biases lead to overestimating the value of more valuable bandits and underestimating the value of less valuable bandits, rendering decisions overall more robust in the face of noise. Our results show how apparently suboptimal learning rules can in fact be reward maximizing if decisions are made with finite computational precision.

Download Full-text

A cholinergic feedback circuit to regulate striatal population uncertainty and optimize reinforcement learning

eLife ◽

10.7554/elife.12029 ◽

2015 ◽

Vol 4 ◽

Cited By ~ 30

Author(s):

Nicholas T Franklin ◽

Michael J Frank

Keyword(s):

Reinforcement Learning ◽

Computational Models ◽

Neural Model ◽

Learning Rate ◽

Prediction Errors ◽

Effective Learning ◽

Stochastic Environments ◽

Spiny Neurons ◽

Alternative Action ◽

Responsiveness To Change

Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments.

Download Full-text

Pupil responses as indicators of value-based decision-making

10.1101/302166 ◽

2018 ◽

Cited By ~ 5

Author(s):

Joanne C. Van Slooten ◽

Sara Jahfari ◽

Tomas Knapen ◽

Jan Theeuwes

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Computational Model ◽

Cognitive Processes ◽

Learning Task ◽

Brain Regions ◽

Computational Approach ◽

Prediction Errors ◽

Reward Prediction ◽

Exciting Possibility

AbstractPupil responses have been used to track cognitive processes during decision-making. Studies have shown that in these cases the pupil reflects the joint activation of many cortical and subcortical brain regions, also those traditionally implicated in value-based learning. However, how the pupil tracks value-based decisions and reinforcement learning is unknown. We combined a reinforcement learning task with a computational model to study pupil responses during value-based decisions, and decision evaluations. We found that the pupil closely tracks reinforcement learning both across trials and participants. Prior to choice, the pupil dilated as a function of trial-by-trial fluctuations in value beliefs. After feedback, early dilation scaled with value uncertainty, whereas later constriction scaled with reward prediction errors. Our computational approach systematically implicates the pupil in value-based decisions, and the subsequent processing of violated value beliefs, ttese dissociable influences provide an exciting possibility to non-invasively study ongoing reinforcement learning in the pupil.

Download Full-text