scholarly journals Compositional generalization in multi-armed bandits

2021 ◽  
Author(s):  
Tankred Saanum ◽  
Eric Schulz ◽  
Maarten Speekenbrink

To what extent do human reward learning and decision-making rely on the ability to represent and generate richly structured relationships between options? We provide evidence that structure learning and the principle of compositionality play crucial roles in human reinforcement learning. In a new multi-armed bandit paradigm, we found evidence that participants are able to learn representations of different reward structures and combine them to make correct generalizations about options in novel contexts. Moreover, we found substantial evidence that participants transferred knowledge of simpler reward structures to make compositional generalizations about rewards in complex contexts. This allowed participants to accumulate more rewards earlier, and to explore less whenever such knowledge transfer was possible. We also provide a computational model which is able to generalize and compose knowledge for complex reward structures. This model describes participant behaviour in the compositional generalization task better than various other models of decision-making and transfer learning.

Symmetry ◽  
2020 ◽  
Vol 12 (4) ◽  
pp. 631
Author(s):  
Chunyang Hu

In this paper, deep reinforcement learning (DRL) and knowledge transfer are used to achieve the effective control of the learning agent for the confrontation in the multi-agent systems. Firstly, a multi-agent Deep Deterministic Policy Gradient (DDPG) algorithm with parameter sharing is proposed to achieve confrontation decision-making of multi-agent. In the process of training, the information of other agents is introduced to the critic network to improve the strategy of confrontation. The parameter sharing mechanism can reduce the loss of experience storage. In the DDPG algorithm, we use four neural networks to generate real-time action and Q-value function respectively and use a momentum mechanism to optimize the training process to accelerate the convergence rate for the neural network. Secondly, this paper introduces an auxiliary controller using a policy-based reinforcement learning (RL) method to achieve the assistant decision-making for the game agent. In addition, an effective reward function is used to help agents balance losses of enemies and our side. Furthermore, this paper also uses the knowledge transfer method to extend the learning model to more complex scenes and improve the generalization of the proposed confrontation model. Two confrontation decision-making experiments are designed to verify the effectiveness of the proposed method. In a small-scale task scenario, the trained agent can successfully learn to fight with the competitors and achieve a good winning rate. For large-scale confrontation scenarios, the knowledge transfer method can gradually improve the decision-making level of the learning agent.


Author(s):  
Zhenhai Gao ◽  
Xiangtong Yan ◽  
Fei Gao ◽  
Lei He

Decision-making is one of the key parts of the research on vehicle longitudinal autonomous driving. Considering the behavior of human drivers when designing autonomous driving decision-making strategies is a current research hotspot. In longitudinal autonomous driving decision-making strategies, traditional rule-based decision-making strategies are difficult to apply to complex scenarios. Current decision-making methods that use reinforcement learning and deep reinforcement learning construct reward functions designed with safety, comfort, and economy. Compared with human drivers, the obtained decision strategies still have big gaps. Focusing on the above problems, this paper uses the driver’s behavior data to design the reward function of the deep reinforcement learning algorithm through BP neural network fitting, and uses the deep reinforcement learning DQN algorithm and the DDPG algorithm to establish two driver-like longitudinal autonomous driving decision-making models. The simulation experiment compares the decision-making effect of the two models with the driver curve. The results shows that the two algorithms can realize driver-like decision-making, and the consistency of the DDPG algorithm and human driver behavior is higher than that of the DQN algorithm, the effect of the DDPG algorithm is better than the DQN algorithm.


2021 ◽  
Author(s):  
Monja P. Neuser ◽  
Franziska Kräutlein ◽  
Anne Kühnel ◽  
Vanessa Teckentrup ◽  
Jennifer Svaldi ◽  
...  

AbstractReinforcement learning is a core facet of motivation and alterations have been associated with various mental disorders. To build better models of individual learning, repeated measurement of value-based decision-making is crucial. However, the focus on lab-based assessment of reward learning has limited the number of measurements and the test-retest reliability of many decision-related parameters is therefore unknown. Here, we developed an open-source cross-platform application Influenca that provides a novel reward learning task complemented by ecological momentary assessment (EMA) for repeated assessment over weeks. In this task, players have to identify the most effective medication by selecting the best option after integrating offered points with changing probabilities (according to random Gaussian walks). Participants can complete up to 31 levels with 150 trials each. To encourage replay on their preferred device, in-game screens provide feedback on the progress. Using an initial validation sample of 127 players (2904 runs), we found that reinforcement learning parameters such as the learning rate and reward sensitivity show low to medium intra-class correlations (ICC: 0.22-0.52), indicating substantial within- and between-subject variance. Notably, state items showed comparable ICCs as reinforcement learning parameters. To conclude, our innovative and openly customizable app framework provides a gamified task that optimizes repeated assessments of reward learning to better quantify intra- and inter-individual differences in value-based decision-making over time.


2018 ◽  
Author(s):  
Joanne C. Van Slooten ◽  
Sara Jahfari ◽  
Tomas Knapen ◽  
Jan Theeuwes

AbstractPupil responses have been used to track cognitive processes during decision-making. Studies have shown that in these cases the pupil reflects the joint activation of many cortical and subcortical brain regions, also those traditionally implicated in value-based learning. However, how the pupil tracks value-based decisions and reinforcement learning is unknown. We combined a reinforcement learning task with a computational model to study pupil responses during value-based decisions, and decision evaluations. We found that the pupil closely tracks reinforcement learning both across trials and participants. Prior to choice, the pupil dilated as a function of trial-by-trial fluctuations in value beliefs. After feedback, early dilation scaled with value uncertainty, whereas later constriction scaled with reward prediction errors. Our computational approach systematically implicates the pupil in value-based decisions, and the subsequent processing of violated value beliefs, ttese dissociable influences provide an exciting possibility to non-invasively study ongoing reinforcement learning in the pupil.


Information ◽  
2019 ◽  
Vol 10 (11) ◽  
pp. 341 ◽  
Author(s):  
Hu ◽  
Xu

Multi-Robot Confrontation on physics-based simulators is a complex and time-consuming task, but simulators are required to evaluate the performance of the advanced algorithms. Recently, a few advanced algorithms have been able to produce considerably complex levels in the context of the robot confrontation system when the agents are facing multiple opponents. Meanwhile, the current confrontation decision-making system suffers from difficulties in optimization and generalization. In this paper, a fuzzy reinforcement learning (RL) and the curriculum transfer learning are applied to the micromanagement for robot confrontation system. Firstly, an improved Qlearning in the semi-Markov decision-making process is designed to train the agent and an efficient RL model is defined to avoid the curse of dimensionality. Secondly, a multi-agent RL algorithm with parameter sharing is proposed to train the agents. We use a neural network with adaptive momentum acceleration as a function approximator to estimate the state-action function. Then, a method of fuzzy logic is used to regulate the learning rate of RL. Thirdly, a curriculum transfer learning method is used to extend the RL model to more difficult scenarios, which ensures the generalization of the decision-making system. The experimental results show that the proposed method is effective.


2020 ◽  
Author(s):  
Joana Carvalheiro ◽  
Vasco A. Conceição ◽  
Ana Mesquita ◽  
Ana Seara-Cardoso

AbstractAcute stress is ubiquitous in everyday life, but the extent to which acute stress affects how people learn from the outcomes of their choices is still poorly understood. Here, we investigate how acute stress impacts reward and punishment learning in men using a reinforcement-learning task. Sixty-two male participants performed the task whilst under stress and control conditions. We observed that acute stress impaired participants’ choice performance towards monetary gains, but not losses. To unravel the mechanism(s) underlying such impairment, we fitted a reinforcement-learning model to participants’ trial-by-trial choices. Computational modeling indicated that under acute stress participants learned more slowly from positive prediction errors — when the outcomes were better than expected — consistent with stress-induced dopamine disruptions. Such mechanistic understanding of how acute stress impairs reward learning is particularly important given the pervasiveness of stress in our daily life and the impact that stress can have on our wellbeing and mental health.


PLoS ONE ◽  
2021 ◽  
Vol 16 (1) ◽  
pp. e0244434
Author(s):  
Jianhong Zhu ◽  
Junya Hashimoto ◽  
Kentaro Katahira ◽  
Makoto Hirakawa ◽  
Takashi Nakao

The value learning process has been investigated using decision-making tasks with a correct answer specified by the external environment (externally guided decision-making, EDM). In EDM, people are required to adjust their choices based on feedback, and the learning process is generally explained by the reinforcement learning (RL) model. In addition to EDM, value is learned through internally guided decision-making (IDM), in which no correct answer defined by external circumstances is available, such as preference judgment. In IDM, it has been believed that the value of the chosen item is increased and that of the rejected item is decreased (choice-induced preference change; CIPC). An RL-based model called the choice-based learning (CBL) model had been proposed to describe CIPC, in which the values of chosen and/or rejected items are updated as if own choice were the correct answer. However, the validity of the CBL model has not been confirmed by fitting the model to IDM behavioral data. The present study aims to examine the CBL model in IDM. We conducted simulations, a preference judgment task for novel contour shapes, and applied computational model analyses to the behavioral data. The results showed that the CBL model with both the chosen and rejected value’s updated were a good fit for the IDM behavioral data compared to the other candidate models. Although previous studies using subjective preference ratings had repeatedly reported changes only in one of the values of either the chosen or rejected items, we demonstrated for the first time both items’ value changes were based solely on IDM choice behavioral data with computational model analyses.


2020 ◽  
Author(s):  
Jianhong Zhu ◽  
Junya Hashimoto ◽  
Kentaro Katahira ◽  
Makoto Hirakawa ◽  
Takashi Nakao

The value learning process has been investigated using decision-making tasks with a correct answer specified by the external environment (externally guided decision-making, EDM). In EDM, people are required to adjust their choices based on feedback, and the learning process is generally explained by the reinforcement learning (RL) model. In addition to EDM, value is learned through internally guided decision-making (IDM), in which no correct answer defined by external circumstances is available, such as preference judgment. In IDM, it has been believed that the value of the chosen item is increased and that of the rejected item is decreased (choice-induced preference change; CIPC). An RL-based model called the choice-based learning (CBL) model had been proposed to describe CIPC, in which the values of chosen and/or rejected items are updated as if own choice were the correct answer. However, the validity of the CBL model has not been confirmed by fitting the model to IDM behavioral data. The present study aims to examine the CBL model in IDM. We conducted simulations, a preference judgment task for novel contour shapes, and applied computational model analyses to the behavioral data. The results showed that the CBL model with both the chosen and rejected value’s updated were a good fit for the IDM behavioral data compared to the other candidate models. Although previous studies using subjective preference ratings had repeatedly reported changes only in one of the values of either the chosen or rejected items, we demonstrated for the first time both items’ value changes were based solely on IDM choice behavioral data with computational model analyses.


Author(s):  
Syed Ihtesham Hussain Shah ◽  
Antonio Coronato

Reinforcement Learning (RL) methods provide a solution for decision-making problems under uncertainty. An agent finds a suitable policy through a reward function by interacting with a dynamic environment. However, for complex and large problems it is very difficult to specify and tune the reward function. Inverse Reinforcement Learning (IRL) may mitigate this problem by learning the reward function through expert demonstrations. This work exploits an IRL method named Max-Margin Algorithm (MMA) to learn the reward function for a robotic navigation problem. The learned reward function reveals the demonstrated policy (expert policy) better than all other policies. Results show that this method has better convergence and learned reward functions through the adopted method represents expert behavior more efficiently.


Sign in / Sign up

Export Citation Format

Share Document