scholarly journals From internal models toward metacognitive AI

Author(s):  
Mitsuo Kawato ◽  
Aurelio Cortese

AbstractIn several papers published in Biological Cybernetics in the 1980s and 1990s, Kawato and colleagues proposed computational models explaining how internal models are acquired in the cerebellum. These models were later supported by neurophysiological experiments using monkeys and neuroimaging experiments involving humans. These early studies influenced neuroscience from basic, sensory-motor control to higher cognitive functions. One of the most perplexing enigmas related to internal models is to understand the neural mechanisms that enable animals to learn large-dimensional problems with so few trials. Consciousness and metacognition—the ability to monitor one’s own thoughts, may be part of the solution to this enigma. Based on literature reviews of the past 20 years, here we propose a computational neuroscience model of metacognition. The model comprises a modular hierarchical reinforcement-learning architecture of parallel and layered, generative-inverse model pairs. In the prefrontal cortex, a distributed executive network called the “cognitive reality monitoring network” (CRMN) orchestrates conscious involvement of generative-inverse model pairs in perception and action. Based on mismatches between computations by generative and inverse models, as well as reward prediction errors, CRMN computes a “responsibility signal” that gates selection and learning of pairs in perception, action, and reinforcement learning. A high responsibility signal is given to the pairs that best capture the external world, that are competent in movements (small mismatch), and that are capable of reinforcement learning (small reward-prediction error). CRMN selects pairs with higher responsibility signals as objects of metacognition, and consciousness is determined by the entropy of responsibility signals across all pairs. This model could lead to new-generation AI, which exhibits metacognition, consciousness, dimension reduction, selection of modules and corresponding representations, and learning from small samples. It may also lead to the development of a new scientific paradigm that enables the causal study of consciousness by combining CRMN and decoded neurofeedback.

2019 ◽  
Author(s):  
David J. Ottenheimer ◽  
Bilal A. Bari ◽  
Elissa Sutlief ◽  
Kurt M. Fraser ◽  
Tabitha H. Kim ◽  
...  

ABSTRACTLearning from past interactions with the environment is critical for adaptive behavior. Within the framework of reinforcement learning, the nervous system builds expectations about future reward by computing reward prediction errors (RPEs), the difference between actual and predicted rewards. Correlates of RPEs have been observed in the midbrain dopamine system, which is thought to locally compute this important variable in service of learning. However, the extent to which RPE signals may be computed upstream of the dopamine system is largely unknown. Here, we quantify history-based RPE signals in the ventral pallidum (VP), an input region to the midbrain dopamine system implicated in reward-seeking behavior. We trained rats to associate cues with future delivery of reward and fit computational models to predict individual neuron firing rates at the time of reward delivery. We found that a subset of VP neurons encoded RPEs and did so more robustly than nucleus accumbens, an input to VP. VP RPEs predicted trial-by-trial task engagement, and optogenetic inhibition of VP reduced subsequent task-related reward seeking. Consistent with reinforcement learning, activity of VP RPE cells adapted when rewards were delivered in blocks. We further found that history- and cue-based RPEs were largely separate across the VP neural population. The presence of behaviorally-instructive RPE signals in the VP suggests a pivotal role for this region in value-based computations.


2020 ◽  
Author(s):  
Clay B. Holroyd ◽  
Tom Verguts

Despite continual debate for the past thirty years about the function of anterior cingulate cortex (ACC), its key contribution to neurocognition remains unknown. Here we review computational models that illustrate three core principles of ACC function (related to hierarchy, world models and cost), as well as four constraints on the neural implementation of these principles (related to modularity, binding, encoding and learning and regulation). These observations suggest a role for ACC in hierarchical model-based hierarchical reinforcement learning, which instantiates a mechanism for motivating the execution of high-level plans.


2019 ◽  
Author(s):  
Melissa J. Sharpe ◽  
Hannah M. Batchelor ◽  
Lauren E. Mueller ◽  
Chun Yun Chang ◽  
Etienne J.P. Maes ◽  
...  

AbstractDopamine neurons fire transiently in response to unexpected rewards. These neural correlates are proposed to signal the reward prediction error described in model-free reinforcement learning algorithms. This error term represents the unpredicted or ‘excess’ value of the rewarding event. In model-free reinforcement learning, this value is then stored as part of the learned value of any antecedent cues, contexts or events, making them intrinsically valuable, independent of the specific rewarding event that caused the prediction error. In support of equivalence between dopamine transients and this model-free error term, proponents cite causal optogenetic studies showing that artificially induced dopamine transients cause lasting changes in behavior. Yet none of these studies directly demonstrate the presence of cached value under conditions appropriate for associative learning. To address this gap in our knowledge, we conducted three studies where we optogenetically activated dopamine neurons while rats were learning associative relationships, both with and without reward. In each experiment, the antecedent cues failed to acquired value and instead entered into value-independent associative relationships with the other cues or rewards. These results show that dopamine transients, constrained within appropriate learning situations, support valueless associative learning.


2021 ◽  
Author(s):  
Virginie Patt ◽  
Daniela Palombo ◽  
Michael Esterman ◽  
Mieke Verfaellie

Simple probabilistic reinforcement learning is recognized as a striatum-based learning system, but in recent years, has also been associated with hippocampal involvement. The present study examined whether such involvement may be attributed to observation-based learning processes, running in parallel to striatum-based reinforcement learning. A computational model of observation-based learning (OL), mirroring classic models of reinforcement-based learning (RL), was constructed and applied to the neuroimaging dataset of Palombo, Hayes, Reid, & Verfaellie (2019). Hippocampal contributions to value-based learning: Converging evidence from fMRI and amnesia. Cognitive, Affective & Behavioral Neuroscience, 19(3), 523–536. Results suggested that observation-based learning processes may indeed take place concomitantly to reinforcement learning and involve activation of the hippocampus and central orbitofrontal cortex (cOFC). However, rather than independent mechanisms running in parallel, the brain correlates of the OL and RL prediction errors indicated collaboration between systems, with direct implication of the hippocampus in computations of the discrepancy between the expected and actual reinforcing values of actions. These findings are consistent with previous accounts of a role for the hippocampus in encoding the strength of observed stimulus-outcome associations, with updating of such associations through striatal reinforcement-based computations. Additionally, enhanced negative prediction error signaling was found in the anterior insula with greater use of OL over RL processes. This result may suggest an additional mode of collaboration between OL and RL systems, implicating the error monitoring network.


2014 ◽  
Vol 26 (3) ◽  
pp. 635-644 ◽  
Author(s):  
Olav E. Krigolson ◽  
Cameron D. Hassall ◽  
Todd C. Handy

Our ability to make decisions is predicated upon our knowledge of the outcomes of the actions available to us. Reinforcement learning theory posits that actions followed by a reward or punishment acquire value through the computation of prediction errors—discrepancies between the predicted and the actual reward. A multitude of neuroimaging studies have demonstrated that rewards and punishments evoke neural responses that appear to reflect reinforcement learning prediction errors [e.g., Krigolson, O. E., Pierce, L. J., Holroyd, C. B., & Tanaka, J. W. Learning to become an expert: Reinforcement learning and the acquisition of perceptual expertise. Journal of Cognitive Neuroscience, 21, 1833–1840, 2009; Bayer, H. M., & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129–141, 2005; O'Doherty, J. P. Reward representations and reward-related learning in the human brain: Insights from neuroimaging. Current Opinion in Neurobiology, 14, 769–776, 2004; Holroyd, C. B., & Coles, M. G. H. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109, 679–709, 2002]. Here, we used the brain ERP technique to demonstrate that not only do rewards elicit a neural response akin to a prediction error but also that this signal rapidly diminished and propagated to the time of choice presentation with learning. Specifically, in a simple, learnable gambling task, we show that novel rewards elicited a feedback error-related negativity that rapidly decreased in amplitude with learning. Furthermore, we demonstrate the existence of a reward positivity at choice presentation, a previously unreported ERP component that has a similar timing and topography as the feedback error-related negativity that increased in amplitude with learning. The pattern of results we observed mirrored the output of a computational model that we implemented to compute reward prediction errors and the changes in amplitude of these prediction errors at the time of choice presentation and reward delivery. Our results provide further support that the computations that underlie human learning and decision-making follow reinforcement learning principles.


2021 ◽  
pp. 1-31
Author(s):  
Germain Lefebvre ◽  
Christopher Summerfield ◽  
Rafal Bogacz

Abstract Reinforcement learning involves updating estimates of the value of states and actions on the basis of experience. Previous work has shown that in humans, reinforcement learning exhibits a confirmatory bias: when the value of a chosen option is being updated, estimates are revised more radically following positive than negative reward prediction errors, but the converse is observed when updating the unchosen option value estimate. Here, we simulate performance on a multi-arm bandit task to examine the consequences of a confirmatory bias for reward harvesting. We report a paradoxical finding: that confirmatory biases allow the agent to maximize reward relative to an unbiased updating rule. This principle holds over a wide range of experimental settings and is most influential when decisions are corrupted by noise. We show that this occurs because on average, confirmatory biases lead to overestimating the value of more valuable bandits and underestimating the value of less valuable bandits, rendering decisions overall more robust in the face of noise. Our results show how apparently suboptimal learning rules can in fact be reward maximizing if decisions are made with finite computational precision.


eLife ◽  
2015 ◽  
Vol 4 ◽  
Author(s):  
Nicholas T Franklin ◽  
Michael J Frank

Convergent evidence suggests that the basal ganglia support reinforcement learning by adjusting action values according to reward prediction errors. However, adaptive behavior in stochastic environments requires the consideration of uncertainty to dynamically adjust the learning rate. We consider how cholinergic tonically active interneurons (TANs) may endow the striatum with such a mechanism in computational models spanning three Marr's levels of analysis. In the neural model, TANs modulate the excitability of spiny neurons, their population response to reinforcement, and hence the effective learning rate. Long TAN pauses facilitated robustness to spurious outcomes by increasing divergence in synaptic weights between neurons coding for alternative action values, whereas short TAN pauses facilitated stochastic behavior but increased responsiveness to change-points in outcome contingencies. A feedback control system allowed TAN pauses to be dynamically modulated by uncertainty across the spiny neuron population, allowing the system to self-tune and optimize performance across stochastic environments.


2018 ◽  
Author(s):  
Joanne C. Van Slooten ◽  
Sara Jahfari ◽  
Tomas Knapen ◽  
Jan Theeuwes

AbstractPupil responses have been used to track cognitive processes during decision-making. Studies have shown that in these cases the pupil reflects the joint activation of many cortical and subcortical brain regions, also those traditionally implicated in value-based learning. However, how the pupil tracks value-based decisions and reinforcement learning is unknown. We combined a reinforcement learning task with a computational model to study pupil responses during value-based decisions, and decision evaluations. We found that the pupil closely tracks reinforcement learning both across trials and participants. Prior to choice, the pupil dilated as a function of trial-by-trial fluctuations in value beliefs. After feedback, early dilation scaled with value uncertainty, whereas later constriction scaled with reward prediction errors. Our computational approach systematically implicates the pupil in value-based decisions, and the subsequent processing of violated value beliefs, ttese dissociable influences provide an exciting possibility to non-invasively study ongoing reinforcement learning in the pupil.


Sign in / Sign up

Export Citation Format

Share Document