Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning

It is well known that quantifying uncertainty in the action-value estimates is crucial for efficient exploration in reinforcement learning. Ensemble sampling offers a relatively computationally tractable way of doing this using randomized value functions. However, it still requires a huge amount of computational resources for complex problems. In this paper, we present an alternative, computationally efficient way to induce exploration using index sampling. We use an indexed value function to represent uncertainty in our action-value estimates. We first present an algorithm to learn parameterized indexed value function through a distributional version of temporal difference in a tabular setting and prove its regret bound. Then, in a computational point of view, we propose a dual-network architecture, Parameterized Indexed Networks (PINs), comprising one mean network and one uncertainty network to learn the indexed value function. Finally, we show the efficacy of PINs through computational experiments.

Download Full-text

A Hybrid Algorithm in Reinforcement Learning for Crowd Simulation

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f9187.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 5251-5255

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Crowd Simulation ◽

Value Functions ◽

Q Learning ◽

Efficient Measurement ◽

Multi Agent ◽

Hybrid Agent ◽

Multiple Value ◽

Transportation Applications

Exploiting the efficiency and stability of Dynamic Crowd, the paper proposes a hybrid crowd simulation algorithm that runs using multi agents and it mainly focuses on identifying the crowd to simulate. An efficient measurement for both static and dynamic crowd simulation is applied in tracking and transportation applications. The proposed Hybrid Agent Reinforcement Learning (HARL) algorithm combines the Q-Learning off-policy value function and SARSA algorithm on-policy value function, which is used for dynamic crowd evacuation scenario. The HARL algorithm performs multiple value functions and combines the policy value function derived from the multi agent to improve the performance. In addition, the efficiency of the HARL algorithm is able to demonstrate in varied crowd sizes. Two kinds of applications are used in Reinforcement Learning such as tracking applications and transportation monitoring applications for pretending the crowd sizes.

Download Full-text

Stochasticity, Nonlinear Value Functions, and Update Rules in Learning Aesthetic Biases

Frontiers in Human Neuroscience ◽

10.3389/fnhum.2021.639081 ◽

2021 ◽

Vol 15 ◽

Author(s):

Norberto M. Grzywacz

Keyword(s):

Reinforcement Learning ◽

Learning Process ◽

Value Function ◽

Stochastic Dynamics ◽

Peak Shift ◽

Learning Performance ◽

Value Functions ◽

Sensory Stimuli ◽

Free Parameters ◽

Update Rules

A theoretical framework for the reinforcement learning of aesthetic biases was recently proposed based on brain circuitries revealed by neuroimaging. A model grounded on that framework accounted for interesting features of human aesthetic biases. These features included individuality, cultural predispositions, stochastic dynamics of learning and aesthetic biases, and the peak-shift effect. However, despite the success in explaining these features, a potential weakness was the linearity of the value function used to predict reward. This linearity meant that the learning process employed a value function that assumed a linear relationship between reward and sensory stimuli. Linearity is common in reinforcement learning in neuroscience. However, linearity can be problematic because neural mechanisms and the dependence of reward on sensory stimuli were typically nonlinear. Here, we analyze the learning performance with models including optimal nonlinear value functions. We also compare updating the free parameters of the value functions with the delta rule, which neuroscience models use frequently, vs. updating with a new Phi rule that considers the structure of the nonlinearities. Our computer simulations showed that optimal nonlinear value functions resulted in improvements of learning errors when the reward models were nonlinear. Similarly, the new Phi rule led to improvements in these errors. These improvements were accompanied by the straightening of the trajectories of the vector of free parameters in its phase space. This straightening meant that the process became more efficient in learning the prediction of reward. Surprisingly, however, this improved efficiency had a complex relationship with the rate of learning. Finally, the stochasticity arising from the probabilistic sampling of sensory stimuli, rewards, and motivations helped the learning process narrow the range of free parameters to nearly optimal outcomes. Therefore, we suggest that value functions and update rules optimized for social and ecological constraints are ideal for learning aesthetic biases.

Download Full-text

Reinforcement learning in the environment where optimal action value function is partly discontinuous

2016 55th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE) ◽

10.1109/sice.2016.7749277 ◽

2016 ◽

Author(s):

Shingo Shibusawa ◽

Takeshi Shibuya

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Optimal Action ◽

Action Value

Download Full-text

Reinforcement Learning for Optimizing Driving Policies on Cruising Taxis Services

Sustainability ◽

10.3390/su12218883 ◽

2020 ◽

Vol 12 (21) ◽

pp. 8883

Author(s):

Kun Jin ◽

Wei Wang ◽

Xuedong Hua ◽

Wei Zhou

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

State Action ◽

Future Reward ◽

Long Run ◽

Markov Decision ◽

Action Value ◽

Data Expansion ◽

Taking Action ◽

The Value Function

As the key element of urban transportation, taxis services significantly provide convenience and comfort for residents’ travel. However, the reality has not shown much efficiency. Previous researchers mainly aimed to optimize policies by order dispatch on ride-hailing services, which cannot be applied in cruising taxis services. This paper developed the reinforcement learning (RL) framework to optimize driving policies on cruising taxis services. Firstly, we formulated the drivers’ behaviours as the Markov decision process (MDP) progress, considering the influences after taking action in the long run. The RL framework using dynamic programming and data expansion was employed to calculate the state-action value function. Following the value function, drivers can determine the best choice and then quantify the expected future reward at a particular state. By utilizing historic orders data in Chengdu, we analysed the function value’s spatial distribution and demonstrated how the model could optimize the driving policies. Finally, the realistic simulation of the on-demand platform was built. Compared with other benchmark methods, the results verified that the new model performs better in increasing total revenue, answer rate and decreasing waiting time, with the relative percentages of 4.8%, 6.2% and −27.27% at most.

Download Full-text

Affordance as general value function: a computational model

Adaptive Behavior ◽

10.1177/1059712321999421 ◽

2021 ◽

pp. 105971232199942

Author(s):

Daniel Graves ◽

Johannes Günther ◽

Jun Luo

Keyword(s):

Deep Learning ◽

Reinforcement Learning ◽

Value Function ◽

Direct Perception ◽

Value Functions ◽

Extensive Review ◽

Action And Perception ◽

Real World Applications ◽

The Right

General value functions (GVFs) in the reinforcement learning (RL) literature are long-term predictive summaries of the outcomes of agents following specific policies in the environment. Affordances as perceived action possibilities with specific valence may be cast into predicted policy-relative goodness and modeled as GVFs. A systematic explication of this connection shows that GVFs and especially their deep-learning embodiments (1) realize affordance prediction as a form of direct perception, (2) illuminate the fundamental connection between action and perception in affordance, and (3) offer a scalable way to learn affordances using RL methods. Through an extensive review of existing literature on GVF applications and representative affordance research in robotics, we demonstrate that GVFs provide the right framework for learning affordances in real-world applications. In addition, we highlight a few new avenues of research opened up by the perspective of “affordance as GVF,” including using GVFs for orchestrating complex behaviors.

Download Full-text

Determinantal Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014659 ◽

2019 ◽

Vol 33 ◽

pp. 4659-4666

Author(s):

Takayuki Osogami ◽

Rudy Raymond

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Positive Semidefinite ◽

Positive Semidefinite Matrix ◽

The Matrix ◽

Multi Agent ◽

Action Value ◽

The Individual ◽

Partially Observable ◽

Semidefinite Matrix

We study reinforcement learning for controlling multiple agents in a collaborative manner. In some of those tasks, it is insufficient for the individual agents to take relevant actions, but those actions should also have diversity. We propose the approach of using the determinant of a positive semidefinite matrix to approximate the action-value function in reinforcement learning, where we learn the matrix in a way that it represents the relevance and diversity of the actions. Experimental results show that the proposed approach allows the agents to learn a nearly optimal policy approximately ten times faster than baseline approaches in benchmark tasks of multi-agent reinforcement learning. The proposed approach is also shown to achieve the performance that cannot be achieved with conventional approaches in partially observable environment with exponentially large action space.

Download Full-text

A solution for the Elevators Group Dispatch by Multiagent Reinforcement Learning

10.5753/eniac.2019.9322 ◽

2019 ◽

Author(s):

Jordão Memória ◽

José Maia

Keyword(s):

Reinforcement Learning ◽

Function Approximation ◽

Value Function ◽

The State ◽

Evaluation Function ◽

State Action ◽

Traffic Pattern ◽

Multiagent Reinforcement Learning ◽

Multi Agent ◽

Action Value

In this work, a modeling and algorithm based on multiagent reinforcement learning is developed for the problem of elevator group dispatch. The main advantage is that, along with the function approximation, this multi-agent solution leads to reduction of the state space, allowing complex states to be addressed with a synthesizing evaluation function. Each elevator is considered an agent that have to decide about two actions: answer or ignore the new call. With some iterations, the agents learn the weights of an evaluation function which approximate the state-action value function. The performance of solution (average waiting time - AWT), shown varying the traffic pattern, flow of people, number of elevators and number of floors, is comparable to other current proposals reported in the literature.

Download Full-text

1A1-M14 Reinforcement Learning in Continuous State and Action Spaces : Action Value Functions Expressed by Sigmoid Neural Networks and CMAC(Evolution and Learning for Robotics)

The Proceedings of JSME annual Conference on Robotics and Mechatronics (Robomec) ◽

10.1299/jsmermd.2011._1a1-m14_1 ◽

2011 ◽

Vol 2011 (0) ◽

pp. _1A1-M14_1-_1A1-M14_4

Author(s):

Kazuaki YAMADA

Keyword(s):

Neural Networks ◽

Reinforcement Learning ◽

Value Functions ◽

Continuous State ◽

Action Value ◽

Action Spaces

Download Full-text

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5784 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3741-3748

Author(s):

Kristopher De Asis ◽

Alan Chan ◽

Silviu Pitis ◽

Richard Sutton ◽

Daniel Graves

Keyword(s):

Reinforcement Learning ◽

Function Approximation ◽

Value Function ◽

Temporal Difference ◽

Value Functions ◽

Difference Methods ◽

Td Methods ◽

The Stability ◽

The Value Function ◽

Temporal Difference Methods

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as “the deadly triad”). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and n-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.

Download Full-text

Ensemble Network Architecture for Deep Reinforcement Learning

Mathematical Problems in Engineering ◽

10.1155/2018/2129393 ◽

2018 ◽

Vol 2018 ◽

pp. 1-6 ◽

Cited By ~ 6

Author(s):

Xi-liang Chen ◽

Lei Cao ◽

Chen-xi Li ◽

Zhi-xiong Xu ◽

Jun Lai

Keyword(s):

Reinforcement Learning ◽

Network Architecture ◽

Function Approximation ◽

Value Function ◽

Learning Algorithm ◽

Approximation Error ◽

Value Function Approximation ◽

Value Evaluation ◽

Target Values ◽

Classical Control

The popular deepQlearning algorithm is known to be instability because of theQ-value’s shake and overestimation action values under certain conditions. These issues tend to adversely affect their performance. In this paper, we develop the ensemble network architecture for deep reinforcement learning which is based on value function approximation. The temporal ensemble stabilizes the training process by reducing the variance of target approximation error and the ensemble of target values reduces the overestimate and makes better performance by estimating more accurateQ-value. Our results show that this architecture leads to statistically significant better value evaluation and more stable and better performance on several classical control tasks at OpenAI Gym environment.

Download Full-text