Affordance as general value function: a computational model

General value functions (GVFs) in the reinforcement learning (RL) literature are long-term predictive summaries of the outcomes of agents following specific policies in the environment. Affordances as perceived action possibilities with specific valence may be cast into predicted policy-relative goodness and modeled as GVFs. A systematic explication of this connection shows that GVFs and especially their deep-learning embodiments (1) realize affordance prediction as a form of direct perception, (2) illuminate the fundamental connection between action and perception in affordance, and (3) offer a scalable way to learn affordances using RL methods. Through an extensive review of existing literature on GVF applications and representative affordance research in robotics, we demonstrate that GVFs provide the right framework for learning affordances in real-world applications. In addition, we highlight a few new avenues of research opened up by the perspective of “affordance as GVF,” including using GVFs for orchestrating complex behaviors.

Download Full-text

Generation of Policy-Level Explanations for Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33012514 ◽

2019 ◽

Vol 33 ◽

pp. 2514-2521

Author(s):

Nicholay Topin ◽

Manuela Veloso

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Reinforcement Learning ◽

Markov Chains ◽

Time Complexity ◽

Value Function ◽

Worst Case ◽

Policy Level ◽

Individual Decisions ◽

The Value Function

Though reinforcement learning has greatly benefited from the incorporation of neural networks, the inability to verify the correctness of such systems limits their use. Current work in explainable deep learning focuses on explaining only a single decision in terms of input features, making it unsuitable for explaining a sequence of decisions. To address this need, we introduce Abstracted Policy Graphs, which are Markov chains of abstract states. This representation concisely summarizes a policy so that individual decisions can be explained in the context of expected future transitions. Additionally, we propose a method to generate these Abstracted Policy Graphs for deterministic policies given a learned value function and a set of observed transitions, potentially off-policy transitions used during training. Since no restrictions are placed on how the value function is generated, our method is compatible with many existing reinforcement learning methods. We prove that the worst-case time complexity of our method is quadratic in the number of features and linear in the number of provided transitions, O(|F|2|tr samples|). By applying our method to a family of domains, we show that our method scales well in practice and produces Abstracted Policy Graphs which reliably capture relationships within these domains.

Download Full-text

Deep Reinforcement Learning for the Control of Robotic Manipulation: A Focussed Mini-Review

Robotics ◽

10.3390/robotics10010022 ◽

2021 ◽

Vol 10 (1) ◽

pp. 22

Author(s):

Rongrong Liu ◽

Florent Nageotte ◽

Philippe Zanne ◽

Michel de Mathelin ◽

Birgitta Dresp-Langley

Keyword(s):

Deep Learning ◽

Reinforcement Learning ◽

Robotic Manipulation ◽

Significant Progress ◽

Expert Performance ◽

Optimal Behavior ◽

Real World Applications ◽

Critical Issues ◽

And Control ◽

Manipulation And Control

Deep learning has provided new ways of manipulating, processing and analyzing data. It sometimes may achieve results comparable to, or surpassing human expert performance, and has become a source of inspiration in the era of artificial intelligence. Another subfield of machine learning named reinforcement learning, tries to find an optimal behavior strategy through interactions with the environment. Combining deep learning and reinforcement learning permits resolving critical issues relative to the dimensionality and scalability of data in tasks with sparse reward signals, such as robotic manipulation and control tasks, that neither method permits resolving when applied on its own. In this paper, we present recent significant progress of deep reinforcement learning algorithms, which try to tackle the problems for the application in the domain of robotic manipulation control, such as sample efficiency and generalization. Despite these continuous improvements, currently, the challenges of learning robust and versatile manipulation skills for robots with deep reinforcement learning are still far from being resolved for real-world applications.

Download Full-text

A Hybrid Algorithm in Reinforcement Learning for Crowd Simulation

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f9187.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 5251-5255

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Crowd Simulation ◽

Value Functions ◽

Q Learning ◽

Efficient Measurement ◽

Multi Agent ◽

Hybrid Agent ◽

Multiple Value ◽

Transportation Applications

Exploiting the efficiency and stability of Dynamic Crowd, the paper proposes a hybrid crowd simulation algorithm that runs using multi agents and it mainly focuses on identifying the crowd to simulate. An efficient measurement for both static and dynamic crowd simulation is applied in tracking and transportation applications. The proposed Hybrid Agent Reinforcement Learning (HARL) algorithm combines the Q-Learning off-policy value function and SARSA algorithm on-policy value function, which is used for dynamic crowd evacuation scenario. The HARL algorithm performs multiple value functions and combines the policy value function derived from the multi agent to improve the performance. In addition, the efficiency of the HARL algorithm is able to demonstrate in varied crowd sizes. Two kinds of applications are used in Reinforcement Learning such as tracking applications and transportation monitoring applications for pretending the crowd sizes.

Download Full-text

Stochasticity, Nonlinear Value Functions, and Update Rules in Learning Aesthetic Biases

Frontiers in Human Neuroscience ◽

10.3389/fnhum.2021.639081 ◽

2021 ◽

Vol 15 ◽

Author(s):

Norberto M. Grzywacz

Keyword(s):

Reinforcement Learning ◽

Learning Process ◽

Value Function ◽

Stochastic Dynamics ◽

Peak Shift ◽

Learning Performance ◽

Value Functions ◽

Sensory Stimuli ◽

Free Parameters ◽

Update Rules

A theoretical framework for the reinforcement learning of aesthetic biases was recently proposed based on brain circuitries revealed by neuroimaging. A model grounded on that framework accounted for interesting features of human aesthetic biases. These features included individuality, cultural predispositions, stochastic dynamics of learning and aesthetic biases, and the peak-shift effect. However, despite the success in explaining these features, a potential weakness was the linearity of the value function used to predict reward. This linearity meant that the learning process employed a value function that assumed a linear relationship between reward and sensory stimuli. Linearity is common in reinforcement learning in neuroscience. However, linearity can be problematic because neural mechanisms and the dependence of reward on sensory stimuli were typically nonlinear. Here, we analyze the learning performance with models including optimal nonlinear value functions. We also compare updating the free parameters of the value functions with the delta rule, which neuroscience models use frequently, vs. updating with a new Phi rule that considers the structure of the nonlinearities. Our computer simulations showed that optimal nonlinear value functions resulted in improvements of learning errors when the reward models were nonlinear. Similarly, the new Phi rule led to improvements in these errors. These improvements were accompanied by the straightening of the trajectories of the vector of free parameters in its phase space. This straightening meant that the process became more efficient in learning the prediction of reward. Surprisingly, however, this improved efficiency had a complex relationship with the rate of learning. Finally, the stochasticity arising from the probabilistic sampling of sensory stimuli, rewards, and motivations helped the learning process narrow the range of free parameters to nearly optimal outcomes. Therefore, we suggest that value functions and update rules optimized for social and ecological constraints are ideal for learning aesthetic biases.

Download Full-text

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5784 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3741-3748

Author(s):

Kristopher De Asis ◽

Alan Chan ◽

Silviu Pitis ◽

Richard Sutton ◽

Daniel Graves

Keyword(s):

Reinforcement Learning ◽

Function Approximation ◽

Value Function ◽

Temporal Difference ◽

Value Functions ◽

Difference Methods ◽

Td Methods ◽

The Stability ◽

The Value Function ◽

Temporal Difference Methods

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as “the deadly triad”). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and n-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.

Download Full-text

Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6055 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5948-5955

Author(s):

Tian Tan ◽

Zhihan Xiong ◽

Vikranth R. Dwaracherla

Keyword(s):

Reinforcement Learning ◽

Network Architecture ◽

Value Function ◽

Point Of View ◽

Value Functions ◽

Computationally Efficient ◽

Computational Point ◽

Action Value ◽

Efficient Exploration ◽

Computational Resources

It is well known that quantifying uncertainty in the action-value estimates is crucial for efficient exploration in reinforcement learning. Ensemble sampling offers a relatively computationally tractable way of doing this using randomized value functions. However, it still requires a huge amount of computational resources for complex problems. In this paper, we present an alternative, computationally efficient way to induce exploration using index sampling. We use an indexed value function to represent uncertainty in our action-value estimates. We first present an algorithm to learn parameterized indexed value function through a distributional version of temporal difference in a tabular setting and prove its regret bound. Then, in a computational point of view, we propose a dual-network architecture, Parameterized Indexed Networks (PINs), comprising one mean network and one uncertainty network to learn the indexed value function. Finally, we show the efficacy of PINs through computational experiments.

Download Full-text

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

Journal of Artificial Intelligence Research ◽

10.1613/jair.639 ◽

2000 ◽

Vol 13 ◽

pp. 227-303 ◽

Cited By ~ 389

Author(s):

T. G. Dietterich

Keyword(s):

Reinforcement Learning ◽

Optimal Policy ◽

Value Function ◽

Learning Algorithm ◽

Value Functions ◽

Procedural Semantics ◽

Hierarchical Reinforcement Learning ◽

Model Free ◽

Function Decomposition ◽

The Value Function

This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics---as a subroutine hierarchy---and a declarative semantics---as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this non-hierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

Download Full-text