scholarly journals Learning and Forgetting Using Reinforced Bayesian Change Detection

2018 ◽  
Author(s):  
Vincent Moens ◽  
Alexandre Zénon

AbstractAgents living in volatile environments must be able to detect changes in contingencies while refraining to adapt to unexpected events that are caused by noise. In Reinforcement Learning (RL) frameworks, this requires learning rates that adapt to past reliability of the model. The observation that behavioural flexibility in animals tends to decrease following prolonged training in stable environment provides experimental evidence for such adaptive learning rates. However, in classical RL models, learning rate is either fixed or scheduled and can thus not adapt dynamically to environmental changes. Here, we propose a new Bayesian learning model, using variational inference, that achieves adaptive change detection by the use of Stabilized Forgetting, updating its current belief based on a mixture of fixed, initial priors and previous posterior beliefs. The weight given to these two sources is optimized alongside the other parameters, allowing the model to adapt dynamically to changes in environmental volatility and to unexpected observations. This approach is used to implement the “critic” of an actor-critic RL model, while the actor samples the resulting value distributions to choose which action to undertake. We show that our model can emulate different adaptation strategies to contingency changes, depending on its prior assumptions of environmental stability, and that model parameters can be fit to real data with high accuracy. The model also exhibits trade-offs between flexibility and computational costs that mirror those observed in real data. Overall, the proposed method provides a general framework to study learning flexibility and decision making in RL contexts.Author summaryIn stable contexts, animals and humans exhibit automatic behaviour that allows them to make fast decisions. However, these automatic processes exhibit a lack of flexibility when environmental contingencies change. In the present paper, we propose a model of behavioural automatization that is based on adaptive forgetting and that emulates these properties. The model builds an estimate of the stability of the environment and uses this estimate to adjust its learning rate and the balance between exploration and exploitation policies. The model performs Bayesian inference on latent variables that represent relevant environmental properties, such as reward functions, optimal policies or environment stability. From there, the model makes decisions in order to maximize long-term rewards, with a noise proportional to environmental uncertainty. This rich model encompasses many aspects of Reinforcement Learning (RL), such as Temporal Difference RL and counterfactual learning, and accounts for the reduced computational cost of automatic behaviour. Using simulations, we show that this model leads to interesting predictions about the efficiency with which subjects adapt to sudden change of contingencies after prolonged training.

2020 ◽  
pp. 1-22
Author(s):  
Luis E. Nieto-Barajas ◽  
Rodrigo S. Targino

ABSTRACT We propose a stochastic model for claims reserving that captures dependence along development years within a single triangle. This dependence is based on a gamma process with a moving average form of order $p \ge 0$ which is achieved through the use of poisson latent variables. We carry out Bayesian inference on model parameters and borrow strength across several triangles, coming from different lines of businesses or companies, through the use of hierarchical priors. We carry out a simulation study as well as a real data analysis. Results show that reserve estimates, for the real data set studied, are more accurate with our gamma dependence model as compared to the benchmark over-dispersed poisson that assumes independence.


2020 ◽  
Author(s):  
Jonathan W. Kanen ◽  
Qiang Luo ◽  
Mojtaba R. Kandroodi ◽  
Rudolf N. Cardinal ◽  
Trevor W. Robbins ◽  
...  

AbstractThe non-selective serotonin 2A (5-HT2A) receptor agonist lysergic acid diethylamide (LSD) holds promise as a treatment for some psychiatric disorders. Psychedelic drugs such as LSD have been suggested to have therapeutic actions through their effects on learning. The behavioural effects of LSD in humans, however, remain largely unexplored. Here we examined how LSD affects probabilistic reversal learning in healthy humans. Conventional measures assessing sensitivity to immediate feedback (“win-stay” and “lose-shift” probabilities) were unaffected, whereas LSD increased the impact of the strength of initial learning on perseveration. Computational modelling revealed that the most pronounced effect of LSD was enhancement of the reward learning rate. The punishment learning rate was also elevated. Increased reinforcement learning rates suggest LSD induced a state of heightened plasticity. These results indicate a potential mechanism through which revision of maladaptive associations could occur.


2022 ◽  
Author(s):  
Chenxu Hao ◽  
Lilian E. Cabrera-Haro ◽  
Ziyong Lin ◽  
Patricia Reuter-Lorenz ◽  
Richard L. Lewis

To understand how acquired value impacts how we perceive and process stimuli, psychologists have developed the Value Learning Task (VLT; e.g., Raymond & O’Brien, 2009). The task consists of a series of trials in which participants attempt to maximize accumulated winnings as they make choices from a pair of presented images associated with probabilistic win, loss, or no-change outcomes. Despite the task having a symmetric outcome structure for win and loss pairs, people learn win associations better than loss associations (Lin, Cabrera-Haro, & Reuter-Lorenz, 2020). This asymmetry could lead to differences when the stimuli are probed in subsequent tasks, compromising inferences about how acquired value affects downstream processing. We investigate the nature of the asymmetry using a standard error-driven reinforcement learning model with a softmax choice rule. Despite having no special role for valence, the model yields the asymmetry observed in human behavior, whether the model parameters are set to maximize empirical fit, or task payoff. The asymmetry arises from an interaction between a neutral initial value estimate and a choice policy that exploits while exploring, leading to more poorly discriminated value estimates for loss stimuli. We also show how differences in estimated individual learning rates help to explain individual differences in the observed win-loss asymmetries, and how the final value estimates produced by the model provide a simple account of a post-learning explicit value categorization task.


Algorithms ◽  
2020 ◽  
Vol 13 (9) ◽  
pp. 239 ◽  
Author(s):  
Menglin Li ◽  
Xueqiang Gu ◽  
Chengyi Zeng ◽  
Yuan Feng

Reinforcement learning, as a branch of machine learning, has been gradually applied in the control field. However, in the practical application of the algorithm, the hyperparametric approach to network settings for deep reinforcement learning still follows the empirical attempts of traditional machine learning (supervised learning and unsupervised learning). This method ignores part of the information generated by agents exploring the environment contained in the updating of the reinforcement learning value function, which will affect the performance of the convergence and cumulative return of reinforcement learning. The reinforcement learning algorithm based on dynamic parameter adjustment is a new method for setting learning rate parameters of deep reinforcement learning. Based on the traditional method of setting parameters for reinforcement learning, this method analyzes the advantages of different learning rates at different stages of reinforcement learning and dynamically adjusts the learning rates in combination with the temporal-difference (TD) error values to achieve the advantages of different learning rates in different stages to improve the rationality of the algorithm in practical application. At the same time, by combining the Robbins–Monro approximation algorithm and deep reinforcement learning algorithm, it is proved that the algorithm of dynamic regulation learning rate can theoretically meet the convergence requirements of the intelligent control algorithm. In the experiment, the effect of this method is analyzed through the continuous control scenario in the standard experimental environment of ”Car-on-The-Hill” of reinforcement learning, and it is verified that the new method can achieve better results than the traditional reinforcement learning in practical application. According to the model characteristics of the deep reinforcement learning, a more suitable setting method for the learning rate of the deep reinforcement learning network proposed. At the same time, the feasibility of the method has been proved both in theory and in the application. Therefore, the method of setting the learning rate parameter is worthy of further development and research.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Michiyo Sugawara ◽  
Kentaro Katahira

AbstractThe learning rate is a key parameter in reinforcement learning that determines the extent to which novel information (outcome) is incorporated in guiding subsequent actions. Numerous studies have reported that the magnitude of the learning rate in human reinforcement learning is biased depending on the sign of the reward prediction error. However, this asymmetry can be observed as a statistical bias if the fitted model ignores the choice autocorrelation (perseverance), which is independent of the outcomes. Therefore, to investigate the genuine process underlying human choice behavior using empirical data, one should dissociate asymmetry in learning and perseverance from choice behavior. The present study addresses this issue by using a Hybrid model incorporating asymmetric learning rates and perseverance. First, by conducting simulations, we demonstrate that the Hybrid model can identify the true underlying process. Second, using the Hybrid model, we show that empirical data collected from a web-based experiment are governed by perseverance rather than asymmetric learning. Finally, we apply the Hybrid model to two open datasets in which asymmetric learning was reported. As a result, the asymmetric learning rate was validated in one dataset but not another.


2017 ◽  
Author(s):  
Matthias Guggenmos ◽  
Philipp Sterzer

AbstractIt is well established that learning can occur without external feedback, yet normative reinforcement learning theories have difficulties explaining such instances of learning. Recently, we reported on a confidence-based reinforcement learn-ing model for the model case of perceptual learning (Guggenmos, Wilbertz, Hebart, & Sterzer, 2016), according to which the brain capitalizes on internal monitoring processes when no external feedback is available. In the model, internal confidence prediction errors – the difference between current confidence and expected confidence – serve as teaching signals to guide learning. In the present paper, we explore an extension to this idea. The main idea is that the neural information processing pathways activated for a given sensory stimulus are subject to fluctuations, where some pathway configurations lead to higher confidence than others. Confidence prediction errors strengthen pathway configurations for which fluctuations lead to above-average confidence and weaken those that are associated with below-average con-fidence. We show through simulation that the model is capable of self-reinforced perceptual learning and can benefit from exploratory network fluctuations. In addition, by simulating different model parameters, we show that the ideal confidence-based learner should (i) exhibit high degrees of network fluctuation in the initial phase of learning, but re-duced fluctuations as learning progresses, (ii) have a high learning rate for network updates for immediate performance gains, but a low learning rate for long-term maximum performance, and (iii) be neither too under-nor too overconfident. In sum, we present a model in which confidence prediction errors strengthen favorable network fluctuations and enable learning in the absence of external feedback. The model can be readily applied to data of real-world perceptual tasks in which observers provided both choice and confidence reports.


2021 ◽  
Author(s):  
Stefano Palminteri

Do we preferentially learn from outcomes that confirm our choices? This is one of the most basic, and yet consequence-bearing, questions concerning reinforcement learning. In recent years, we investigated this question in a series of studies implementing increasingly complex behavioral protocols. The learning rates fitted in experiments featuring partial or complete feedback, as well as free and forced choices, were systematically found to be consistent with a choice-confirmation bias. This result is robust across a broad range of outcome contingencies and response modalities. One of the prominent behavioral consequences of the confirmatory learning rate pattern is choice hysteresis: that is the tendency of repeating previous choices, despite contradictory evidence. As robust and replicable as they have proven to be, these findings were (legitimately) challenged by a couple of studies pointing out that a choice-confirmatory pattern of learning rates may spuriously arise from not taking into consideration an explicit choice autocorrelation term in the model. In the present study, we re-analyze data from four previously published papers (in total nine experiments; N=363), originally included in the studies demonstrating (or criticizing) the choice-confirmation bias in human participants. We fitted two models: one featured valence-specific updates (i.e., different learning rates for confirmatory and disconfirmatory outcomes) and one additionally including an explicit choice autocorrelation process (gradual perseveration). Our analysis confirms that the inclusion of the gradual perseveration process in the model significantly reduces the estimated choice-confirmation bias. However, in all considered experiments, the choice-confirmation bias remains present at the meta-analytical level, and significantly different from zero in most experiments. Our results demonstrate that the choice-confirmation bias resists the inclusion of an explicit choice autocorrelation term, thus proving to be a robust feature of human reinforcement learning. We conclude by discussing the psychological plausibility of the gradual perseveration process in the context of these behavioral paradigms and by pointing to additional computational processes that may play an important role in estimating and interpreting the computational biases under scrutiny.


2021 ◽  
Author(s):  
Mojtaba Rostami Kandroodi ◽  
Abdol-Hossein Vahabie ◽  
Sara Ahmadi ◽  
Babak Nadjar Araabi ◽  
Majid Nili Ahmadabadi

AbstractThe ability to predict the future is essential for decision-making and interaction with the environment to avoid punishment and gain reward. Reinforcement learning algorithms provide a normative way for interactive learning, especially in volatile environments. The optimal strategy for the classic reinforcement learning model is to increase the learning rate as volatility increases. Inspired by optimistic bias in humans, an alternative reinforcement learning model has been developed by adding a punishment learning rate to the classic reinforcement learning model. In this study, we aim to 1) compare the performance of these two models in interaction with different environments, and 2) find optimal parameters for the models. Our simulations indicate that having two different learning rates for rewards and punishments increases performance in a volatile environment. Investigation of the optimal parameters shows that in almost all environments, having a higher reward learning rate compared to the punishment learning rate is beneficial for achieving higher performance which in this case is the accumulation of more rewards. Our results suggest that to achieve high performance, we need a shorter memory window for recent rewards and a longer memory window for punishments. This is consistent with optimistic bias in human behavior.


2020 ◽  
Author(s):  
Liyu Xia ◽  
Sarah L Master ◽  
Maria K Eckstein ◽  
Beth Baribault ◽  
Ronald E Dahl ◽  
...  

AbstractIn the real world, many relationships between events are uncertain and probabilistic. Uncertainty is also likely to be a more common feature of daily experience for youth because they have less experience to draw from than adults. Some studies suggests probabilistic learning may be inefficient in youth compared to adults [1], while others suggest it may be more efficient in youth that are in mid adolescence [2, 3]. Here we used a probabilistic reinforcement learning task to test how youth age 8-17 (N = 187) and adults age 18-30 (N = 110) learn about stable probabilistic contingencies. Performance increased with age through early-twenties, then stabilized. Using hierarchical Bayesian methods to fit computational reinforcement learning models, we show that all participants’ performance was better explained by models in which negative outcomes had minimal to no impact on learning. The performance increase over age was driven by 1) an increase in learning rate (i.e. decrease in integration time horizon); 2) a decrease in noisy/exploratory choices. In mid-adolescence age 13-15, salivary testosterone and learning rate were positively related. We discuss our findings in the context of other studies and hypotheses about adolescent brain development.Author summaryAdolescence is a time of great uncertainty. It is also a critical time for brain development, learning, and decision making in social and educational domains. There are currently contradictory findings about learning in adolescence. We sought to better isolate how learning from stable probabilistic contingencies changes during adolescence with a task that previously showed interesting results in adolescents. We collected a relatively large sample size (297 participants) across a wide age range (8-30), to trace the adolescent developmental trajectory of learning under stable but uncertain conditions. We found that age in our sample was positively associated with higher learning rates and lower choice exploration. Within narrow age bins, we found that higher saliva testosterone levels were associated with higher learning rates in participants age 13-15 years. These findings can help us better isolate the trajectory of maturation of core learning and decision making processes during adolescence.


2019 ◽  
Vol XVI (2) ◽  
pp. 1-11
Author(s):  
Farrukh Jamal ◽  
Hesham Mohammed Reyad ◽  
Soha Othman Ahmed ◽  
Muhammad Akbar Ali Shah ◽  
Emrah Altun

A new three-parameter continuous model called the exponentiated half-logistic Lomax distribution is introduced in this paper. Basic mathematical properties for the proposed model were investigated which include raw and incomplete moments, skewness, kurtosis, generating functions, Rényi entropy, Lorenz, Bonferroni and Zenga curves, probability weighted moment, stress strength model, order statistics, and record statistics. The model parameters were estimated by using the maximum likelihood criterion and the behaviours of these estimates were examined by conducting a simulation study. The applicability of the new model is illustrated by applying it on a real data set.


Sign in / Sign up

Export Citation Format

Share Document