Evolution of Reinforcement Learning in Uncertain Environments: A Simple                Explanation for Complex Foraging Behaviors

Reinforcement learning is a fundamental process by which organisms learn to achieve goals from their interactions with the environment. Using evolutionary computation techniques we evolve (near-)optimal neuronal learning rules in a simple neural network model of reinforcement learning in bumblebees foraging for nectar. The resulting neural networks exhibit efficient reinforcement learning, allowing the bees to respond rapidly to changes in reward contingencies. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels and to the well-documented choice strategies of risk aversion and probability matching. Additionally, risk aversion is shown to emerge even when bees are evolved in a completely risk-less environment. In contrast to existing theories in economics and game theory, risk-averse behavior is shown to be a direct consequence of (near-)optimal reinforcement learning, without requiring additional assumptions such as the existence of a nonlinear subjective utility function for rewards. Our results are corroborated by a rigorous mathematical analysis, and their robustness in real-world situations is supported by experiments in a mobile robot. Thus we provide a biologically founded, parsimonious, and novel explanation for risk aversion and probability matching.

Download Full-text

Evolution of Reinforcement Learning in Uncertain Environments: A Simple Explanation for Complex Foraging Behaviors

Adaptive Behavior ◽

10.1177/10597123020101001 ◽

2002 ◽

Vol 10 (1) ◽

pp. 5-24 ◽

Cited By ~ 68

Author(s):

Y. Niv ◽

D. Joel ◽

I. Meilijson ◽

E. Ruppin

Keyword(s):

Reinforcement Learning ◽

Simple Explanation ◽

Uncertain Environments

Download Full-text

Evolution of Reinforcement Learning in Uncertain Environments: Emergence of Risk-Aversion and Matching

Advances in Artificial Life - Lecture Notes in Computer Science ◽

10.1007/3-540-44811-x_27 ◽

2001 ◽

pp. 252-261 ◽

Cited By ~ 2

Author(s):

Yael Niv ◽

Daphna Joel ◽

Isaac Meilijson ◽

Eytan Ruppin

Keyword(s):

Reinforcement Learning ◽

Risk Aversion ◽

Uncertain Environments

Download Full-text

Interbank Market Formation through Reinforcement Learning and Risk Aversion

SSRN Electronic Journal ◽

10.2139/ssrn.2994585 ◽

2017 ◽

Author(s):

Anqi Liu ◽

Cheuk Yin Mo ◽

Mark Endel Paddrik ◽

Steve Y. Yang

Keyword(s):

Reinforcement Learning ◽

Risk Aversion ◽

Interbank Market ◽

Market Formation

Download Full-text

Attenuated directed exploration during reinforcement learning in gambling disorder

10.1101/823583 ◽

2019 ◽

Cited By ~ 3

Author(s):

A. Wiehler ◽

K. Chakroun ◽

J. Peters

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Gambling Disorder ◽

Brain Activity ◽

Clinical Status ◽

Classical Problem ◽

Behavioral Flexibility ◽

Network Connectivity ◽

Prediction Errors ◽

Reward Contingencies

AbstractGambling disorder is a behavioral addiction associated with impairments in decision-making and reduced behavioral flexibility. Decision-making in volatile environments requires a flexible trade-off between exploitation of options with high expected values and exploration of novel options to adapt to changing reward contingencies. This classical problem is known as the exploration-exploitation dilemma. We hypothesized gambling disorder to be associated with a specific reduction in directed (uncertainty-based) exploration compared to healthy controls, accompanied by changes in brain activity in a fronto-parietal exploration-related network.Twenty-three frequent gamblers and nineteen matched controls performed a classical four-armed bandit task during functional magnetic resonance imaging. Computational modeling revealed that choice behavior in both groups contained signatures of directed exploration, random exploration and perseveration. Gamblers showed a specific reduction in directed exploration, while random exploration and perseveration were similar between groups.Neuroimaging revealed no evidence for group differences in neural representations of expected value and reward prediction errors. Likewise, our hypothesis of attenuated fronto-parietal exploration effects in gambling disorder was not supported. However, during directed exploration, gamblers showed reduced parietal and substantia nigra / ventral tegmental area activity. Cross-validated classification analyses revealed that connectivity in an exploration-related network was predictive of clinical status, suggesting alterations in network dynamics in gambling disorder.In sum, we show that reduced flexibility during reinforcement learning in volatile environments in gamblers is attributable to a reduction in directed exploration rather than an increase in perseveration. Neuroimaging findings suggest that patterns of network connectivity might be more diagnostic of gambling disorder than univariate value and prediction error effects. We provide a computational account of flexibility impairments in gamblers during reinforcement learning that might arise as a consequence of dopaminergic dysregulation in this disorder.

Download Full-text

Reinforcement Learning using Convolutional Neural Network for Game Prediction

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.g5698.069820 ◽

2020 ◽

Vol 9 (8) ◽

pp. 425-430

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Computer Games ◽

Elevated Level ◽

Q Learning ◽

The Third ◽

Level Information ◽

Simple Neural Network ◽

Fully Connected ◽

Deep Learning Model

The paper presents a Deep learning model for playing computer games with elevated level information utilizing Reinforcement learning learning. The games are activity restricted (like snakes, catcher, air-bandit and so on.). The implementation is progressive in three parts. The first part deals with a simple neural network, the second one with Deep Q network and further to increase the accuracy and speed of the algorithm, the third part consists of a model consisting of convolution neural network for image processing and giving outputs from the fully connected layers so as to estimate the probability of an action being taken based on information extracted from inputs where we apply Q-learning to determine the best possible move. The results are further analysed and compared to provide an overview of the improvements in each methods.

Download Full-text

Trading Utility and Uncertainty: Applying the Value of Information to Resolve the Exploration–Exploitation Dilemma in Reinforcement Learning

Handbook of Reinforcement Learning and Control - Studies in Systems, Decision and Control ◽

10.1007/978-3-030-60990-0_19 ◽

2021 ◽

pp. 557-610

Author(s):

Isaac J. Sledge ◽

José C. Príncipe

Keyword(s):

Reinforcement Learning ◽

Value Of Information ◽

Exploration Exploitation

Download Full-text

A Two-Layer Approach to Developing Self-Adaptive Multi-Agent Systems in Open Environment

International Journal of Agent Technologies and Systems ◽

10.4018/ijats.2014010104 ◽

2014 ◽

Vol 6 (1) ◽

pp. 65-85 ◽

Cited By ~ 2

Author(s):

Xinjun Mao ◽

Menggao Dong ◽

Haibin Zhu

Keyword(s):

Reinforcement Learning ◽

Adaptive Systems ◽

Multi Agent Systems ◽

Uncertain Environments ◽

Implementation Framework ◽

Fine Grain ◽

Adaptation Mechanisms ◽

Multi Agent ◽

Self Adaptation ◽

Self Adaptive

Development of self-adaptive systems situated in open and uncertain environments is a great challenge in the community of software engineering due to the unpredictability of environment changes and the variety of self-adaptation manners. Explicit specification of expected changes and various self-adaptations at design-time, an approach often adopted by developers, seems ineffective. This paper presents an agent-based approach that combines two-layer self-adaptation mechanisms and reinforcement learning together to support the development and running of self-adaptive systems. The approach takes self-adaptive systems as multi-agent organizations and enables the agent itself to make decisions on self-adaptation by learning at run-time and at different levels. The proposed self-adaptation mechanisms that are based on organization metaphors enable self-adaptation at two layers: fine-grain behavior level and coarse-grain organization level. Corresponding reinforcement learning algorithms on self-adaptation are designed and integrated with the two-layer self-adaptation mechanisms. This paper further details developmental technologies, based on the above approach, in establishing self-adaptive systems, including extended software architecture for self-adaptation, an implementation framework, and a development process. A case study and experiment evaluations are conducted to illustrate the effectiveness of the proposed approach.

Download Full-text

Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity

Neural Computation ◽

10.1162/neco.2007.19.6.1468 ◽

2007 ◽

Vol 19 (6) ◽

pp. 1468-1502 ◽

Cited By ~ 159

Author(s):

Răzvan V. Florian

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Neural Model ◽

Spike Timing ◽

Spike Response ◽

Learning Rules ◽

Xor Problem ◽

Eligibility Trace ◽

Spike Response Model ◽

Intrinsic Plasticity

The persistent modification of synaptic efficacy as a function of the relative timing of pre- and postsynaptic spikes is a phenomenon known as spike-timing-dependent plasticity (STDP). Here we show that the modulation of STDP by a global reward signal leads to reinforcement learning. We first derive analytically learning rules involving reward-modulated spike-timing-dependent synaptic and intrinsic plasticity, by applying a reinforcement learning algorithm to the stochastic spike response model of spiking neurons. These rules have several features common to plasticity mechanisms experimentally found in the brain. We then demonstrate in simulations of networks of integrate-and-fire neurons the efficacy of two simple learning rules involving modulated STDP. One rule is a direct extension of the standard STDP model (modulated STDP), and the other one involves an eligibility trace stored at each synapse that keeps a decaying memory of the relationships between the recent pairs of pre- and postsynaptic spike pairs (modulated STDP with eligibility trace). This latter rule permits learning even if the reward signal is delayed. The proposed rules are able to solve the XOR problem with both rate coded and temporally coded input and to learn a target output firing-rate pattern. These learning rules are biologically plausible, may be used for training generic artificial spiking neural networks, regardless of the neural model used, and suggest the experimental investigation in animals of the existence of reward-modulated STDP.

Download Full-text

UAV Maneuvering Target Tracking in Uncertain Environments Based on Deep Reinforcement Learning and Meta-Learning

Remote Sensing ◽

10.3390/rs12223789 ◽

2020 ◽

Vol 12 (22) ◽

pp. 3789

Author(s):

Bo Li ◽

Zhigang Gan ◽

Daqing Chen ◽

Dyachenko Sergey Aleksandrovich

Keyword(s):

Reinforcement Learning ◽

Target Tracking ◽

Uncertain Environments ◽

Target Movement ◽

Maneuvering Target Tracking ◽

Novel Approach ◽

Policy Gradient ◽

Meta Learning ◽

Experience Replay ◽

Task Experience

This paper combines deep reinforcement learning (DRL) with meta-learning and proposes a novel approach, named meta twin delayed deep deterministic policy gradient (Meta-TD3), to realize the control of unmanned aerial vehicle (UAV), allowing a UAV to quickly track a target in an environment where the motion of a target is uncertain. This approach can be applied to a variety of scenarios, such as wildlife protection, emergency aid, and remote sensing. We consider a multi-task experience replay buffer to provide data for the multi-task learning of the DRL algorithm, and we combine meta-learning to develop a multi-task reinforcement learning update method to ensure the generalization capability of reinforcement learning. Compared with the state-of-the-art algorithms, namely the deep deterministic policy gradient (DDPG) and twin delayed deep deterministic policy gradient (TD3), experimental results show that the Meta-TD3 algorithm has achieved a great improvement in terms of both convergence value and convergence rate. In a UAV target tracking problem, Meta-TD3 only requires a few steps to train to enable a UAV to adapt quickly to a new target movement mode more and maintain a better tracking effectiveness.

Download Full-text

Reinforcement learning control of robot manipulators in uncertain environments

2009 IEEE International Conference on Industrial Technology ◽

10.1109/icit.2009.4939504 ◽

2009 ◽

Cited By ~ 5

Author(s):

Hitesh Shah ◽

M. Gopal

Keyword(s):

Reinforcement Learning ◽

Robot Manipulators ◽

Learning Control ◽

Uncertain Environments

Download Full-text