scholarly journals The Tortoise and the Hare: Interactions between Reinforcement Learning and Working Memory

2018 ◽  
Vol 30 (10) ◽  
pp. 1422-1432 ◽  
Author(s):  
Anne G. E. Collins

Learning to make rewarding choices in response to stimuli depends on a slow but steady process, reinforcement learning, and a fast and flexible, but capacity-limited process, working memory. Using both systems in parallel, with their contributions weighted based on performance, should allow us to leverage the best of each system: rapid early learning, supplemented by long-term robust acquisition. However, this assumes that using one process does not interfere with the other. We use computational modeling to investigate the interactions between the two processes in a behavioral experiment and show that working memory interferes with reinforcement learning. Previous research showed that neural representations of reward prediction errors, a key marker of reinforcement learning, were blunted when working memory was used for learning. We thus predicted that arbitrating in favor of working memory to learn faster in simple problems would weaken the reinforcement learning process. We tested this by measuring performance in a delayed testing phase where the use of working memory was impossible, and thus participant choices depended on reinforcement learning. Counterintuitively, but confirming our predictions, we observed that associations learned most easily were retained worse than associations learned slower: Using working memory to learn quickly came at the cost of long-term retention. Computational modeling confirmed that this could only be accounted for by working memory interference in reinforcement learning computations. These results further our understanding of how multiple systems contribute in parallel to human learning and may have important applications for education and computational psychiatry.

2017 ◽  
Author(s):  
Anne G.E. Collins

AbstractLearning to make rewarding choices in response to stimuli depends on a slow but steady process, reinforcement learning, and a fast and flexible, but capacity limited process, working memory. Using both systems in parallel, with their contributions weighted based on performance, should allow us to leverage the best of each system: rapid early learning, supplemented by long term robust acquisition. However, this assumes that using one process does not interfere with the other. We use computational modeling to investigate the interactions between the two processes in a behavioral experiment, and show that working memory interferes with reinforcement learning. Previous research showed that neural representations of reward prediction errors, a key marker of reinforcement learning, were blunted when working memory was used for learning. We thus predicted that arbitrating in favor of working memory to learn faster in simple problems would weaken the reinforcement learning process. We tested this by measuring performance in a delayed testing phase where the use of working memory was impossible, and thus subject choices depended on reinforcement learning. Counter-intuitively, but confirming our predictions, we observed that associations learned most easily were retained worse than associations learned slower: using working memory to learn quickly came at the cost of long-term retention. Computational modeling confirmed that this could only be accounted for by working memory interference in reinforcement learning computations. These results further our understanding of how multiple systems contribute in parallel to human learning, and may have important applications for education and computational psychiatry.


2019 ◽  
Author(s):  
Erdem Pulcu

AbstractWe are living in a dynamic world in which stochastic relationships between cues and outcome events create different sources of uncertainty1 (e.g. the fact that not all grey clouds bring rain). Living in an uncertain world continuously probes learning systems in the brain, guiding agents to make better decisions. This is a type of value-based decision-making which is very important for survival in the wild and long-term evolutionary fitness. Consequently, reinforcement learning (RL) models describing cognitive/computational processes underlying learning-based adaptations have been pivotal in behavioural2,3 and neural sciences4–6, as well as machine learning7,8. This paper demonstrates the suitability of novel update rules for RL, based on a nonlinear relationship between prediction errors (i.e. difference between the agent’s expectation and the actual outcome) and learning rates (i.e. a coefficient with which agents update their beliefs about the environment), that can account for learning-based adaptations in the face of environmental uncertainty. These models illustrate how learners can flexibly adapt to dynamically changing environments.


2017 ◽  
Vol 29 (10) ◽  
pp. 1646-1655 ◽  
Author(s):  
Anne G. E. Collins

Human learning is highly efficient and flexible. A key contributor to this learning flexibility is our ability to generalize new information across contexts that we know require the same behavior and to transfer rules to new contexts we encounter. To do this, we structure the information we learn and represent it hierarchically as abstract, context-dependent rules that constrain lower-level stimulus–action–outcome contingencies. Previous research showed that humans create such structure even when it is not needed, presumably because it usually affords long-term generalization benefits. However, computational models predict that creating structure is costly, with slower learning and slower RTs. We tested this prediction in a new behavioral experiment. Participants learned to select correct actions for four visual patterns, in a setting that either afforded (but did not promote) structure learning or enforced nonhierarchical learning, while controlling for the difficulty of the learning problem. Results replicated our previous finding that healthy young adults create structure even when unneeded and that this structure affords later generalization. Furthermore, they supported our prediction that structure learning incurred a major learning cost and that this cost was specifically tied to the effort in selecting abstract rules, leading to more errors when applying those rules. These findings confirm our theory that humans pay a high short-term cost in learning structure to enable longer-term benefits in learning flexibility.


2013 ◽  
Vol 109 (4) ◽  
pp. 1140-1151 ◽  
Author(s):  
Hiroshi Yamada ◽  
Hitoshi Inokawa ◽  
Naoyuki Matsumoto ◽  
Yasumasa Ueda ◽  
Kazuki Enomoto ◽  
...  

Decisions maximizing benefits involve a tradeoff between the quantity of a reward and the cost of elapsed time until an animal receives it. The estimation of long-term reward values is critical to attain the most desirable outcomes over a certain period of time. Reinforcement learning theories have established algorithms to estimate the long-term reward values of multiple future rewards in which the values of future rewards are discounted as a function of how many steps of choices are necessary to achieve them. Here, we report that presumed striatal projection neurons represent the long-term values of multiple future rewards estimated by a standard reinforcement learning model while monkeys are engaged in a series of trial-and-error choices and adaptive decisions for multiple rewards. We found that the magnitude of activity of a subset of neurons was positively correlated with the long-term reward values, and that of another subset of neurons was negatively correlated throughout the entire decision-making process in individual trials: from the start of the task trial, estimation of the values and their comparison among alternatives, choice execution, and evaluation of the received rewards. An idiosyncratic finding was that neurons showing negative correlations represented reward values in the near future (high discounting), while neurons showing positive correlations represented reward values not only in the near future, but also in the far future (low discounting). These findings provide a new insight that long-term value signals are embedded in two subsets of striatal neurons as high and low discounting of multiple future rewards.


2020 ◽  
Author(s):  
Thilo Womelsdorf ◽  
Marcus R. Watson ◽  
Paul Tiesinga

AbstractFlexible learning of changing reward contingencies can be realized with different strategies. A fast learning strategy involves using working memory of recently rewarded objects to guide choices. A slower learning strategy uses prediction errors to gradually update value expectations to improve choices. How the fast and slow strategies work together in scenarios with real-world stimulus complexity is not well known. Here, we disentangle their relative contributions in rhesus monkeys while they learned the relevance of object features at variable attentional load. We found that learning behavior across six subjects is consistently best predicted with a model combining (i) fast working memory (ii) slower reinforcement learning from differently weighted positive and negative prediction errors, as well as (iii) selective suppression of non-chosen feature values and (iv) a meta-learning mechanism adjusting exploration rates based on a memory trace of recent errors. These mechanisms cooperate differently at low and high attentional loads. While working memory was essential for efficient learning at lower attentional loads, enhanced weighting of negative prediction errors and meta-learning were essential for efficient learning at higher attentional loads. Together, these findings pinpoint a canonical set of learning mechanisms and demonstrate how they cooperate when subjects flexibly adjust to environments with variable real-world attentional demands.Significance statementLearning which visual features are relevant for achieving our goals is challenging in real-world scenarios with multiple distracting features and feature dimensions. It is known that in such scenarios learning benefits significantly from attentional prioritization. Here we show that beyond attention, flexible learning uses a working memory system, a separate learning gain for avoiding negative outcomes, and a meta-learning process that adaptively increases exploration rates whenever errors accumulate. These subcomponent processes of cognitive flexibility depend on distinct learning signals that operate at varying timescales, including the most recent reward outcome (for working memory), memories of recent outcomes (for adjusting exploration), and reward prediction errors (for attention augmented reinforcement learning). These results illustrate the specific mechanisms that cooperate during cognitive flexibility.


2021 ◽  
pp. 1-29
Author(s):  
Thilo Womelsdorf ◽  
Marcus R. Watson ◽  
Paul Tiesinga

Abstract Flexible learning of changing reward contingencies can be realized with different strategies. A fast learning strategy involves using working memory of recently rewarded objects to guide choices. A slower learning strategy uses prediction errors to gradually update value expectations to improve choices. How the fast and slow strategies work together in scenarios with real-world stimulus complexity is not well known. Here, we aim to disentangle their relative contributions in rhesus monkeys while they learned the relevance of object features at variable attentional load. We found that learning behavior across six monkeys is consistently best predicted with a model combining (i) fast working memory and (ii) slower reinforcement learning from differently weighted positive and negative prediction errors as well as (iii) selective suppression of nonchosen feature values and (iv) a meta-learning mechanism that enhances exploration rates based on a memory trace of recent errors. The optimal model parameter settings suggest that these mechanisms cooperate differently at low and high attentional loads. Whereas working memory was essential for efficient learning at lower attentional loads, enhanced weighting of negative prediction errors and meta-learning were essential for efficient learning at higher attentional loads. Together, these findings pinpoint a canonical set of learning mechanisms and suggest how they may cooperate when subjects flexibly adjust to environments with variable real-world attentional demands.


2005 ◽  
Vol 17 (7) ◽  
pp. 994-1010 ◽  
Author(s):  
Charan Ranganath ◽  
Michael X. Cohen ◽  
Craig J. Brozinsky

Theories of human memory have led to conflicting views regarding the relationship between working memory (WM) maintenance and episodic long-term memory (LTM) formation. Here, we tested the prediction that WM maintenance operates in two stages, and that processing during the initial stage of WM maintenance promotes successful LTM formation. Results from a functional magnetic resonance imaging study showed that activity in the dorsolateral prefrontal cortex and hippocampus during the initial stage of WM maintenance was predictive of subsequent LTM performance. In a behavioral experiment, we demonstrated that interfering with processing during the initial stage of WM maintenance impaired LTM formation. These results demonstrate that processing during the initial stage of WM maintenance directly contributes to successful LTM formation, and that this effect is mediated by a network that includes the dorsolateral prefrontal cortex and the hippocampus.


2019 ◽  
Author(s):  
Sarah L. Master ◽  
Maria K. Eckstein ◽  
Neta Gotlieb ◽  
Ronald Dahl ◽  
Linda Wilbrecht ◽  
...  

AbstractMultiple neurocognitive systems contribute simultaneously to learning. For example, dopamine and basal ganglia (BG) systems are thought to support reinforcement learning (RL) by incrementally updating the value of choices, while the prefrontal cortex (PFC) contributes different computations, such as actively maintaining precise information in working memory (WM). It is commonly thought that WM and PFC show more protracted development than RL and BG systems, yet their contributions are rarely assessed in tandem. Here, we used a simple learning task to test how RL and WM contribute to changes in learning across adolescence. We tested 187 subjects ages 8 to 17 and 53 adults (25-30). Participants learned stimulus-action associations from feedback; the learning load was varied to be within or exceed WM capacity. Participants age 8-12 learned slower than participants age 13-17, and were more sensitive to load. We used computational modeling to estimate subjects’ use of WM and RL processes. Surprisingly, we found more robust changes in RL than WM during development. RL learning rate increased significantly with age across adolescence and WM parameters showed more subtle changes, many of them early in adolescence. These results underscore the importance of changes in RL processes for the developmental science of learning.Highlights- Subjects combine reinforcement learning (RL) and working memory (WM) to learn- Computational modeling shows RL learning rates grew with age during adolescence- When load was beyond WM capacity, weaker RL compensated less in younger adolescents- WM parameters showed subtler and more puberty-related changes- WM reliance, maintenance, and capacity had separable developmental trajectories- Underscores importance of RL processes in developmental changes in learning


2019 ◽  
Author(s):  
Franziska R. Richter

AbstractMemory schemas are higher-level knowledge structures that store an abstraction of multiple previous experiences. They allow us to retain a multitude of information without the cost of storing every detail. Schemas are believed to be relatively stable, but occasionally have to be updated to remain useful in the face of changing environmental conditions. Once a schema is consolidated, schema updating has been proposed to be the result of a prediction-error (PE) based learning mechanism, similar to the updating of less complex knowledge. However, for schema memory this hypothesis has been difficult to test because no sufficiently sensitive tools to track modifications to complex memory schemas existed so far. Current research on the updating of less complex beliefs and at much shorter time scales has identified the P3 as an electrophysiological correlate of PE-induced updating of beliefs. In this study, I recorded electroencephalography and continuous memory measures during the encoding of schema consistent vs. inconsistent material to test the behavioural and neural correlates of schema updating. I observed that PEs predicted the updating of a schema after a 24-hour delay, especially when participants were faced with inconsistent compared to consistent material. Moreover, the P3 amplitude tracked both the PE at the time of learning as well as the updating of the memory schema in the inconsistent condition. These results demonstrate that schema updating in the face of inconsistent information is driven by PE-based learning, and that similar neural mechanisms underlie the updating of consolidated long-term memory schemas and short-term belief structures.


2016 ◽  
Vol 39 ◽  
Author(s):  
Mary C. Potter

AbstractRapid serial visual presentation (RSVP) of words or pictured scenes provides evidence for a large-capacity conceptual short-term memory (CSTM) that momentarily provides rich associated material from long-term memory, permitting rapid chunking (Potter 1993; 2009; 2012). In perception of scenes as well as language comprehension, we make use of knowledge that briefly exceeds the supposed limits of working memory.


Sign in / Sign up

Export Citation Format

Share Document