Stochasticity, Nonlinear Value Functions, and Update Rules in Learning Aesthetic Biases

Frontiers in Human Neuroscience ◽

10.3389/fnhum.2021.639081 ◽

2021 ◽

Vol 15 ◽

Author(s):

Norberto M. Grzywacz

Keyword(s):

Reinforcement Learning ◽

Learning Process ◽

Value Function ◽

Stochastic Dynamics ◽

Peak Shift ◽

Learning Performance ◽

Value Functions ◽

Sensory Stimuli ◽

Free Parameters ◽

Update Rules

A theoretical framework for the reinforcement learning of aesthetic biases was recently proposed based on brain circuitries revealed by neuroimaging. A model grounded on that framework accounted for interesting features of human aesthetic biases. These features included individuality, cultural predispositions, stochastic dynamics of learning and aesthetic biases, and the peak-shift effect. However, despite the success in explaining these features, a potential weakness was the linearity of the value function used to predict reward. This linearity meant that the learning process employed a value function that assumed a linear relationship between reward and sensory stimuli. Linearity is common in reinforcement learning in neuroscience. However, linearity can be problematic because neural mechanisms and the dependence of reward on sensory stimuli were typically nonlinear. Here, we analyze the learning performance with models including optimal nonlinear value functions. We also compare updating the free parameters of the value functions with the delta rule, which neuroscience models use frequently, vs. updating with a new Phi rule that considers the structure of the nonlinearities. Our computer simulations showed that optimal nonlinear value functions resulted in improvements of learning errors when the reward models were nonlinear. Similarly, the new Phi rule led to improvements in these errors. These improvements were accompanied by the straightening of the trajectories of the vector of free parameters in its phase space. This straightening meant that the process became more efficient in learning the prediction of reward. Surprisingly, however, this improved efficiency had a complex relationship with the rate of learning. Finally, the stochasticity arising from the probabilistic sampling of sensory stimuli, rewards, and motivations helped the learning process narrow the range of free parameters to nearly optimal outcomes. Therefore, we suggest that value functions and update rules optimized for social and ecological constraints are ideal for learning aesthetic biases.

Download Full-text

A Hybrid Algorithm in Reinforcement Learning for Crowd Simulation

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f9187.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 5251-5255

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Crowd Simulation ◽

Value Functions ◽

Q Learning ◽

Efficient Measurement ◽

Multi Agent ◽

Hybrid Agent ◽

Multiple Value ◽

Transportation Applications

Exploiting the efficiency and stability of Dynamic Crowd, the paper proposes a hybrid crowd simulation algorithm that runs using multi agents and it mainly focuses on identifying the crowd to simulate. An efficient measurement for both static and dynamic crowd simulation is applied in tracking and transportation applications. The proposed Hybrid Agent Reinforcement Learning (HARL) algorithm combines the Q-Learning off-policy value function and SARSA algorithm on-policy value function, which is used for dynamic crowd evacuation scenario. The HARL algorithm performs multiple value functions and combines the policy value function derived from the multi agent to improve the performance. In addition, the efficiency of the HARL algorithm is able to demonstrate in varied crowd sizes. Two kinds of applications are used in Reinforcement Learning such as tracking applications and transportation monitoring applications for pretending the crowd sizes.

Download Full-text

Affordance as general value function: a computational model

Adaptive Behavior ◽

10.1177/1059712321999421 ◽

2021 ◽

pp. 105971232199942

Author(s):

Daniel Graves ◽

Johannes Günther ◽

Jun Luo

Keyword(s):

Deep Learning ◽

Reinforcement Learning ◽

Value Function ◽

Direct Perception ◽

Value Functions ◽

Extensive Review ◽

Action And Perception ◽

Real World Applications ◽

The Right

General value functions (GVFs) in the reinforcement learning (RL) literature are long-term predictive summaries of the outcomes of agents following specific policies in the environment. Affordances as perceived action possibilities with specific valence may be cast into predicted policy-relative goodness and modeled as GVFs. A systematic explication of this connection shows that GVFs and especially their deep-learning embodiments (1) realize affordance prediction as a form of direct perception, (2) illuminate the fundamental connection between action and perception in affordance, and (3) offer a scalable way to learn affordances using RL methods. Through an extensive review of existing literature on GVF applications and representative affordance research in robotics, we demonstrate that GVFs provide the right framework for learning affordances in real-world applications. In addition, we highlight a few new avenues of research opened up by the perspective of “affordance as GVF,” including using GVFs for orchestrating complex behaviors.

Download Full-text

Context Transfer in Reinforcement Learning Using Action-Value Functions

Computational Intelligence and Neuroscience ◽

10.1155/2014/428567 ◽

2014 ◽

Vol 2014 ◽

pp. 1-10 ◽

Cited By ~ 4

Author(s):

Amin Mousavi ◽

Babak Nadjar Araabi ◽

Majid Nili Ahmadabadi

Keyword(s):

Reinforcement Learning ◽

Learning Process ◽

Feature Space ◽

Value Functions ◽

Sensors And Actuators ◽

Target Task ◽

Reward Function ◽

Markov Decision ◽

Context Transfer ◽

Action Spaces

This paper discusses the notion of context transfer in reinforcement learning tasks. Context transfer, as defined in this paper, implies knowledge transfer between source and target tasks that share the same environment dynamics and reward function but have different states or action spaces. In other words, the agents learn the same task while using different sensors and actuators. This requires the existence of an underlying common Markov decision process (MDP) to which all the agents’ MDPs can be mapped. This is formulated in terms of the notion of MDP homomorphism. The learning framework isQ-learning. To transfer the knowledge between these tasks, the feature space is used as a translator and is expressed as a partial mapping between the state-action spaces of different tasks. TheQ-values learned during the learning process of the source tasks are mapped to the sets ofQ-values for the target task. These transferredQ-values are merged together and used to initialize the learning process of the target task. An interval-based approach is used to represent and merge the knowledge of the source tasks. Empirical results show that the transferred initialization can be beneficial to the learning process of the target task.

Download Full-text

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5784 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3741-3748

Author(s):

Kristopher De Asis ◽

Alan Chan ◽

Silviu Pitis ◽

Richard Sutton ◽

Daniel Graves

Keyword(s):

Reinforcement Learning ◽

Function Approximation ◽

Value Function ◽

Temporal Difference ◽

Value Functions ◽

Difference Methods ◽

Td Methods ◽

The Stability ◽

The Value Function ◽

Temporal Difference Methods

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as “the deadly triad”). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and n-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.

Download Full-text

Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6055 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5948-5955

Author(s):

Tian Tan ◽

Zhihan Xiong ◽

Vikranth R. Dwaracherla

Keyword(s):

Reinforcement Learning ◽

Network Architecture ◽

Value Function ◽

Point Of View ◽

Value Functions ◽

Computationally Efficient ◽

Computational Point ◽

Action Value ◽

Efficient Exploration ◽

Computational Resources

It is well known that quantifying uncertainty in the action-value estimates is crucial for efficient exploration in reinforcement learning. Ensemble sampling offers a relatively computationally tractable way of doing this using randomized value functions. However, it still requires a huge amount of computational resources for complex problems. In this paper, we present an alternative, computationally efficient way to induce exploration using index sampling. We use an indexed value function to represent uncertainty in our action-value estimates. We first present an algorithm to learn parameterized indexed value function through a distributional version of temporal difference in a tabular setting and prove its regret bound. Then, in a computational point of view, we propose a dual-network architecture, Parameterized Indexed Networks (PINs), comprising one mean network and one uncertainty network to learn the indexed value function. Finally, we show the efficacy of PINs through computational experiments.

Download Full-text

Motor Cortex Encodes A Temporal Difference Reinforcement Learning Process

10.1101/257337 ◽

2018 ◽

Cited By ~ 2

Author(s):

Venkata S Aditya Tarigoppula ◽

John S Choi ◽

John P Hessburg ◽

David B McNiel ◽

Brandi T Marsh ◽

...

Keyword(s):

Reinforcement Learning ◽

Motor Cortex ◽

Learning Process ◽

Neural Activity ◽

Prediction Error ◽

Value Function ◽

Temporal Difference ◽

Reward Prediction Error ◽

Reward Prediction ◽

The Value Function

AbstractTemporal difference reinforcement learning (TDRL) accurately models associative learning observed in animals, where they learn to associate outcome predicting environmental states, termed conditioned stimuli (CS), with the value of outcomes, such as rewards, termed unconditioned stimuli (US). A component of TDRL is the value function, which captures the expected cumulative future reward from a given state. The value function can be modified by changes in the animal’s knowledge, such as by the predictability of its environment. Here we show that primary motor cortical (M1) neurodynamics reflect a TD learning process, encoding a state value function and reward prediction error in line with TDRL. M1 responds to the delivery of reward, and shifts its value related response earlier in a trial, becoming predictive of an expected reward, when reward is predictable due to a CS. This is observed in tasks performed manually or observed passively, as well as in tasks without an explicit CS predicting reward, but simply with a predictable temporal structure, that is a predictable environment. M1 also encodes the expected reward value associated with a set of CS in a multiple reward level CS-US task. Here we extend the Microstimulus TDRL model, reported to accurately capture RL related dopaminergic activity, to account for M1 reward related neural activity in a multitude of tasks.Significance statementThere is a great deal of agreement between aspects of temporal difference reinforcement learning (TDRL) models and neural activity in dopaminergic brain centers. Dopamine is know to be necessary for sensorimotor learning induced synaptic plasticity in the motor cortex (M1), and thus one might expect to see the hallmarks of TDRL in M1, which we show here in the form of a state value function and reward prediction error during. We see these hallmarks even when a conditioned stimulus is not available, but the environment is predictable, during manual tasks with agency, as well as observational tasks without agency. This information has implications towards autonomously updating brain machine interfaces as others and we have proposed and published on.

Download Full-text

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

Journal of Artificial Intelligence Research ◽

10.1613/jair.639 ◽

2000 ◽

Vol 13 ◽

pp. 227-303 ◽

Cited By ~ 389

Author(s):

T. G. Dietterich

Keyword(s):

Reinforcement Learning ◽

Optimal Policy ◽

Value Function ◽

Learning Algorithm ◽

Value Functions ◽

Procedural Semantics ◽

Hierarchical Reinforcement Learning ◽

Model Free ◽

Function Decomposition ◽

The Value Function

This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics---as a subroutine hierarchy---and a declarative semantics---as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this non-hierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

Download Full-text

Solving flow-shop scheduling problem with a reinforcement learning algorithm that generalizes the value function with neural network

Alexandria Engineering Journal ◽

10.1016/j.aej.2021.01.030 ◽

2021 ◽

Vol 60 (3) ◽

pp. 2787-2800

Author(s):

Jianfeng Ren ◽

Chunming Ye ◽

Feng Yang

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Value Function ◽

Flow Shop ◽

Learning Algorithm ◽

Flow Shop Scheduling ◽

Scheduling Problem ◽

Shop Scheduling ◽

The Value Function ◽

Reinforcement Learning Algorithm

Download Full-text

Goal-driven active learning

Autonomous Agents and Multi-Agent Systems ◽

10.1007/s10458-021-09527-5 ◽

2021 ◽

Vol 35 (2) ◽

Author(s):

Nicolas Bougie ◽

Ryutaro Ichise

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Learning Process ◽

Real World ◽

Imitation Learning ◽

Learning Approaches ◽

Wide Range ◽

Fixed Set ◽

Complex Decision Making ◽

Complex Decision

AbstractDeep reinforcement learning methods have achieved significant successes in complex decision-making problems. In fact, they traditionally rely on well-designed extrinsic rewards, which limits their applicability to many real-world tasks where rewards are naturally sparse. While cloning behaviors provided by an expert is a promising approach to the exploration problem, learning from a fixed set of demonstrations may be impracticable due to lack of state coverage or distribution mismatch—when the learner’s goal deviates from the demonstrated behaviors. Besides, we are interested in learning how to reach a wide range of goals from the same set of demonstrations. In this work we propose a novel goal-conditioned method that leverages very small sets of goal-driven demonstrations to massively accelerate the learning process. Crucially, we introduce the concept of active goal-driven demonstrations to query the demonstrator only in hard-to-learn and uncertain regions of the state space. We further present a strategy for prioritizing sampling of goals where the disagreement between the expert and the policy is maximized. We evaluate our method on a variety of benchmark environments from the Mujoco domain. Experimental results show that our method outperforms prior imitation learning approaches in most of the tasks in terms of exploration efficiency and average scores.

Download Full-text

Gaze awareness and metacognitive suggestions by a pedagogical conversational agent: an experimental investigation on interventions to support collaborative learning process and performance

International Journal of Computer-Supported Collaborative Learning ◽

10.1007/s11412-020-09333-3 ◽

2020 ◽

Vol 15 (4) ◽

pp. 469-498 ◽

Cited By ~ 1

Author(s):

Yugo Hayashi

Keyword(s):

Collaborative Learning ◽

Real Time ◽

Learning Process ◽

Common Ground ◽

Learning Performance ◽

Conversational Agent ◽

Learning Gains ◽

Computer Supported Collaborative Learning ◽

Collaborative Process ◽

And Performance

AbstractResearch on collaborative learning has revealed that peer-collaboration explanation activities facilitate reflection and metacognition and that establishing common ground and successful coordination are keys to realizing effective knowledge-sharing in collaborative learning tasks. Studies on computer-supported collaborative learning have investigated how awareness tools can facilitate coordination within a group and how the use of external facilitation scripts can elicit elaborated knowledge during collaboration. However, the separate and joint effects of these tools on the nature of the collaborative process and performance have rarely been investigated. This study investigates how two facilitation methods—coordination support via learner gaze-awareness feedback and metacognitive suggestion provision via a pedagogical conversational agent (PCA)—are able to enhance the learning process and learning gains. Eighty participants, organized into dyads, were enrolled in a 2 × 2 between-subject study. The first and second factors were the presence of real-time gaze feedback (no vs. visible gaze) and that of a suggestion-providing PCA (no vs. visible agent), respectively. Two evaluation methods were used: namely, dialog analysis of the collaborative process and evaluation of learning gains. The real-time gaze feedback and PCA suggestions facilitated the coordination process, while gaze was relatively more effective in improving the learning gains. Learners in the Gaze-feedback condition achieved superior learning gains upon receiving PCA suggestions. A successful coordination/high learning performance correlation was noted solely for learners receiving visible gaze feedback and PCA suggestions simultaneously (visible gaze/visible agent). This finding has the potential to yield improved collaborative processes and learning gains through integration of these two methods as well as contributing towards design principles for collaborative-learning support systems more generally.

Download Full-text