Reinforcement Learning for Multiple HAPS/UAV Coordination: Impact of Exploration–Exploitation Dilemma on Convergence

AbstractThe exploration-exploitation dilemma is one of a few fundamental problems in reinforcement learning and is seen as an intractable problem, mathematically. In this paper we prove the key to finding a tractable solution is to do an unintuitive thing–to explore without considering reward value. We have redefined exploration as having no objective but learning itself. Through theory and experiments we prove that this view leads to a perfect deterministic solution to the dilemma, based on the famous strategy win-stay, lose-switch strategy from game theory. This solution rests on our conjecture that information and reward are equally valuable for survival. Besides offering a mathematical answer, this view seems more robust than traditional approaches because it succeeds in the difficult conditions where rewards are sparse, deceptive, or non-stationary.

Download Full-text

Computational Reinforcement Learning

10.1093/oxfordhb/9780199957996.013.5 ◽

2015 ◽

Author(s):

Todd M. Gureckis ◽

Bradley C. Love

Keyword(s):

Reinforcement Learning ◽

Instrumental Learning ◽

Learning Behavior ◽

Q Learning ◽

Mathematical Ideas ◽

Model Free ◽

Versus Model ◽

Open Questions ◽

History Of ◽

Exploration Exploitation

Reinforcement learning (RL) refers to the scientific study of how animals and machines adapt their behavior in order to maximize reward. The history of RL research can be traced to early work in psychology on instrumental learning behavior. However, the modern field of RL is a highly interdisciplinary area that lies that the intersection of ideas in computer science, machine learning, psychology, and neuroscience. This chapter summarizes the key mathematical ideas underlying this field including the exploration/exploitation dilemma, temporal-difference (TD) learning, Q-learning, and model-based versus model-free learning. In addition, a broad survey of open questions in psychology and neuroscience are reviewed.

Download Full-text

An ant system based exploration-exploitation for reinforcement learning

2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583) ◽

10.1109/icsmc.2004.1400937 ◽

2005 ◽

Cited By ~ 2

Author(s):

Hyeong Soo Chang

Keyword(s):

Reinforcement Learning ◽

Ant System ◽

Exploration Exploitation

Download Full-text

Evolution of Reinforcement Learning in Uncertain Environments: A Simple Explanation for Complex Foraging Behaviors

Adaptive Behavior ◽

10.1177/1059-712302-010001-01 ◽

2002 ◽

Vol 10 (1) ◽

pp. 5-24 ◽

Cited By ~ 31

Author(s):

Yael Niv ◽

Daphna Joel ◽

Isaac Meilijson ◽

Eytan Ruppin

Keyword(s):

Reinforcement Learning ◽

Risk Aversion ◽

Direct Consequence ◽

Simple Explanation ◽

Probability Matching ◽

Uncertain Environments ◽

Learning Rules ◽

Reward Contingencies ◽

Simple Neural Network ◽

Exploration Exploitation

Reinforcement learning is a fundamental process by which organisms learn to achieve goals from their interactions with the environment. Using evolutionary computation techniques we evolve (near-)optimal neuronal learning rules in a simple neural network model of reinforcement learning in bumblebees foraging for nectar. The resulting neural networks exhibit efficient reinforcement learning, allowing the bees to respond rapidly to changes in reward contingencies. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels and to the well-documented choice strategies of risk aversion and probability matching. Additionally, risk aversion is shown to emerge even when bees are evolved in a completely risk-less environment. In contrast to existing theories in economics and game theory, risk-averse behavior is shown to be a direct consequence of (near-)optimal reinforcement learning, without requiring additional assumptions such as the existence of a nonlinear subjective utility function for rewards. Our results are corroborated by a rigorous mathematical analysis, and their robustness in real-world situations is supported by experiments in a mobile robot. Thus we provide a biologically founded, parsimonious, and novel explanation for risk aversion and probability matching.

Download Full-text

Single Trajectory Learning: Exploration Versus Exploitation

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001418590097 ◽

2018 ◽

Vol 32 (06) ◽

pp. 1859009 ◽

Cited By ~ 1

Author(s):

Qiming Fu ◽

Quan Liu ◽

Shan Zhong ◽

Heng Luo ◽

Hongjie Wu ◽

...

Keyword(s):

Reinforcement Learning ◽

Polynomial Function ◽

State Of The Art ◽

Large Set ◽

Crucial Issue ◽

Reward Function ◽

Current State ◽

Exploration Exploitation ◽

Exploration Versus Exploitation ◽

Trajectory Learning

In reinforcement learning (RL), the exploration/exploitation (E/E) dilemma is a very crucial issue, which can be described as searching between the exploration of the environment to find more profitable actions, and the exploitation of the best empirical actions for the current state. We focus on the single trajectory RL problem where an agent is interacting with a partially unknown MDP over single trajectories, and try to deal with the E/E in this setting. Given the reward function, we try to find a good E/E strategy to address the MDPs under some MDP distribution. This is achieved by selecting the best strategy in mean over a potential MDP distribution from a large set of candidate strategies, which is done by exploiting single trajectories drawn from plenty of MDPs. In this paper, we mainly make the following contributions: (1) We discuss the strategy-selector algorithm based on formula set and polynomial function. (2) We provide the theoretical and experimental regret analysis of the learned strategy under an given MDP distribution. (3) We compare these methods with the “state-of-the-art” Bayesian RL method experimentally.

Download Full-text

Pupil Diameter Predicts Changes in the Exploration–Exploitation Trade-off: Evidence for the Adaptive Gain Theory

Journal of Cognitive Neuroscience ◽

10.1162/jocn.2010.21548 ◽

2011 ◽

Vol 23 (7) ◽

pp. 1587-1596 ◽

Cited By ~ 211

Author(s):

Marieke Jepma ◽

Sander Nieuwenhuis

Keyword(s):

Reinforcement Learning ◽

Pupil Diameter ◽

Behavioral Performance ◽

Gambling Task ◽

Trade Off ◽

Choice Strategy ◽

Payoff Structure ◽

Exploration Exploitation ◽

The Relationship ◽

Control State

The adaptive regulation of the balance between exploitation and exploration is critical for the optimization of behavioral performance. Animal research and computational modeling have suggested that changes in exploitative versus exploratory control state in response to changes in task utility are mediated by the neuromodulatory locus coeruleus–norepinephrine (LC–NE) system. Recent studies have suggested that utility-driven changes in control state correlate with pupil diameter, and that pupil diameter can be used as an indirect marker of LC activity. We measured participants' pupil diameter while they performed a gambling task with a gradually changing payoff structure. Each choice in this task can be classified as exploitative or exploratory using a computational model of reinforcement learning. We examined the relationship between pupil diameter, task utility, and choice strategy (exploitation vs. exploration), and found that (i) exploratory choices were preceded by a larger baseline pupil diameter than exploitative choices; (ii) individual differences in baseline pupil diameter were predictive of an individual's tendency to explore; and (iii) changes in pupil diameter surrounding the transition between exploitative and exploratory choices correlated with changes in task utility. These findings provide novel evidence that pupil diameter correlates closely with control state, and are consistent with a role for the LC–NE system in the regulation of the exploration–exploitation trade-off in humans.

Download Full-text

Reinforcement learning: exploration–exploitation dilemma in multi-agent foraging task

OPSEARCH ◽

10.1007/s12597-012-0077-2 ◽

2012 ◽

Vol 49 (3) ◽

pp. 223-236 ◽

Cited By ~ 7

Author(s):

Mohan Yogeswaran ◽

S. G. Ponnambalam

Keyword(s):

Reinforcement Learning ◽

Multi Agent ◽

Foraging Task ◽

Exploration Exploitation

Download Full-text

Pure Correlates of Exploration and Exploitation in the Human Brain

10.1101/103135 ◽

2017 ◽

Author(s):

Tommy C. Blanchard ◽

Samuel J. Gershman

Keyword(s):

Prefrontal Cortex ◽

Reinforcement Learning ◽

Human Brain ◽

Anterior Cingulate Cortex ◽

Fundamental Problem ◽

Cingulate Cortex ◽

Ventromedial Prefrontal Cortex ◽

Anterior Cingulate ◽

Exploration And Exploitation ◽

Exploration Exploitation

AbstractBalancing exploration and exploitation is a fundamental problem in reinforcement learning. Previous neuroimaging studies of the exploration-exploitation dilemma could not completely disentangle these two processes, making it difficult to unambiguously identify their neural signatures. We overcome this problem using a task in which subjects can either observe (pure exploration) or bet (pure exploitation). Insula and dorsal anterior cingulate cortex showed significantly greater activity on observe trials compared to bet trials, suggesting that these regions play a role in driving exploration. A model-based analysis of task performance suggested that subjects chose to observe until a critical evidence threshold was reached. We observed a neural signature of this evidence accumulation process in ventromedial prefrontal cortex. These findings support theories positing an important role for anterior cingulate cortex in exploration, while also providing a new perspective on the roles of insula and ventromedial prefrontal cortex.Significance StatementSitting down at a familiar restaurant, you may choose to order an old favorite or sample a new dish. In reinforcement learning theory, this is known as the exploration-exploitation dilemma. The optimal solution is known to be intractable; therefore, humans must use heuristic strategies. Behavioral studies have revealed several candidate strategies, but identifying the neural mechanisms underlying these strategies is complicated due to the fact that exploration and exploitation are not perfectly dissociable in standard tasks. Using an “observe or bet” task, we identify for the first time pure neural correlates of exploration and exploitation in the human brain.

Download Full-text

Exploring Reward Strategies for Wind Turbine Pitch Control by Reinforcement Learning

Applied Sciences ◽

10.3390/app10217462 ◽

2020 ◽

Vol 10 (21) ◽

pp. 7462

Author(s):

Jesús Enrique Sierra-García ◽

Matilde Santos

Keyword(s):

Reinforcement Learning ◽

Wind Turbine ◽

State Estimator ◽

Pitch Control ◽

Learning Speed ◽

Controller Performance ◽

Greedy Methods ◽

Relationship Of ◽

Exploration Exploitation ◽

The Relationship

In this work, a pitch controller of a wind turbine (WT) inspired by reinforcement learning (RL) is designed and implemented. The control system consists of a state estimator, a reward strategy, a policy table, and a policy update algorithm. Novel reward strategies related to the energy deviation from the rated power are defined. They are designed to improve the efficiency of the WT. Two new categories of reward strategies are proposed: “only positive” (O-P) and “positive-negative” (P-N) rewards. The relationship of these categories with the exploration-exploitation dilemma, the use of ϵ-greedy methods and the learning convergence are also introduced and linked to the WT control problem. In addition, an extensive analysis of the influence of the different rewards in the controller performance and in the learning speed is carried out. The controller is compared with a proportional-integral-derivative (PID) regulator for the same small wind turbine, obtaining better results. The simulations show how the P-N rewards improve the performance of the controller, stabilize the output power around the rated power, and reduce the error over time.

Download Full-text