scholarly journals Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning

2020 ◽  
Vol 34 (04) ◽  
pp. 5948-5955
Author(s):  
Tian Tan ◽  
Zhihan Xiong ◽  
Vikranth R. Dwaracherla

It is well known that quantifying uncertainty in the action-value estimates is crucial for efficient exploration in reinforcement learning. Ensemble sampling offers a relatively computationally tractable way of doing this using randomized value functions. However, it still requires a huge amount of computational resources for complex problems. In this paper, we present an alternative, computationally efficient way to induce exploration using index sampling. We use an indexed value function to represent uncertainty in our action-value estimates. We first present an algorithm to learn parameterized indexed value function through a distributional version of temporal difference in a tabular setting and prove its regret bound. Then, in a computational point of view, we propose a dual-network architecture, Parameterized Indexed Networks (PINs), comprising one mean network and one uncertainty network to learn the indexed value function. Finally, we show the efficacy of PINs through computational experiments.

2020 ◽  
Vol 8 (6) ◽  
pp. 5251-5255

Exploiting the efficiency and stability of Dynamic Crowd, the paper proposes a hybrid crowd simulation algorithm that runs using multi agents and it mainly focuses on identifying the crowd to simulate. An efficient measurement for both static and dynamic crowd simulation is applied in tracking and transportation applications. The proposed Hybrid Agent Reinforcement Learning (HARL) algorithm combines the Q-Learning off-policy value function and SARSA algorithm on-policy value function, which is used for dynamic crowd evacuation scenario. The HARL algorithm performs multiple value functions and combines the policy value function derived from the multi agent to improve the performance. In addition, the efficiency of the HARL algorithm is able to demonstrate in varied crowd sizes. Two kinds of applications are used in Reinforcement Learning such as tracking applications and transportation monitoring applications for pretending the crowd sizes.


2021 ◽  
Vol 15 ◽  
Author(s):  
Norberto M. Grzywacz

A theoretical framework for the reinforcement learning of aesthetic biases was recently proposed based on brain circuitries revealed by neuroimaging. A model grounded on that framework accounted for interesting features of human aesthetic biases. These features included individuality, cultural predispositions, stochastic dynamics of learning and aesthetic biases, and the peak-shift effect. However, despite the success in explaining these features, a potential weakness was the linearity of the value function used to predict reward. This linearity meant that the learning process employed a value function that assumed a linear relationship between reward and sensory stimuli. Linearity is common in reinforcement learning in neuroscience. However, linearity can be problematic because neural mechanisms and the dependence of reward on sensory stimuli were typically nonlinear. Here, we analyze the learning performance with models including optimal nonlinear value functions. We also compare updating the free parameters of the value functions with the delta rule, which neuroscience models use frequently, vs. updating with a new Phi rule that considers the structure of the nonlinearities. Our computer simulations showed that optimal nonlinear value functions resulted in improvements of learning errors when the reward models were nonlinear. Similarly, the new Phi rule led to improvements in these errors. These improvements were accompanied by the straightening of the trajectories of the vector of free parameters in its phase space. This straightening meant that the process became more efficient in learning the prediction of reward. Surprisingly, however, this improved efficiency had a complex relationship with the rate of learning. Finally, the stochasticity arising from the probabilistic sampling of sensory stimuli, rewards, and motivations helped the learning process narrow the range of free parameters to nearly optimal outcomes. Therefore, we suggest that value functions and update rules optimized for social and ecological constraints are ideal for learning aesthetic biases.


2020 ◽  
Vol 12 (21) ◽  
pp. 8883
Author(s):  
Kun Jin ◽  
Wei Wang ◽  
Xuedong Hua ◽  
Wei Zhou

As the key element of urban transportation, taxis services significantly provide convenience and comfort for residents’ travel. However, the reality has not shown much efficiency. Previous researchers mainly aimed to optimize policies by order dispatch on ride-hailing services, which cannot be applied in cruising taxis services. This paper developed the reinforcement learning (RL) framework to optimize driving policies on cruising taxis services. Firstly, we formulated the drivers’ behaviours as the Markov decision process (MDP) progress, considering the influences after taking action in the long run. The RL framework using dynamic programming and data expansion was employed to calculate the state-action value function. Following the value function, drivers can determine the best choice and then quantify the expected future reward at a particular state. By utilizing historic orders data in Chengdu, we analysed the function value’s spatial distribution and demonstrated how the model could optimize the driving policies. Finally, the realistic simulation of the on-demand platform was built. Compared with other benchmark methods, the results verified that the new model performs better in increasing total revenue, answer rate and decreasing waiting time, with the relative percentages of 4.8%, 6.2% and −27.27% at most.


2021 ◽  
pp. 105971232199942
Author(s):  
Daniel Graves ◽  
Johannes Günther ◽  
Jun Luo

General value functions (GVFs) in the reinforcement learning (RL) literature are long-term predictive summaries of the outcomes of agents following specific policies in the environment. Affordances as perceived action possibilities with specific valence may be cast into predicted policy-relative goodness and modeled as GVFs. A systematic explication of this connection shows that GVFs and especially their deep-learning embodiments (1) realize affordance prediction as a form of direct perception, (2) illuminate the fundamental connection between action and perception in affordance, and (3) offer a scalable way to learn affordances using RL methods. Through an extensive review of existing literature on GVF applications and representative affordance research in robotics, we demonstrate that GVFs provide the right framework for learning affordances in real-world applications. In addition, we highlight a few new avenues of research opened up by the perspective of “affordance as GVF,” including using GVFs for orchestrating complex behaviors.


Author(s):  
Takayuki Osogami ◽  
Rudy Raymond

We study reinforcement learning for controlling multiple agents in a collaborative manner. In some of those tasks, it is insufficient for the individual agents to take relevant actions, but those actions should also have diversity. We propose the approach of using the determinant of a positive semidefinite matrix to approximate the action-value function in reinforcement learning, where we learn the matrix in a way that it represents the relevance and diversity of the actions. Experimental results show that the proposed approach allows the agents to learn a nearly optimal policy approximately ten times faster than baseline approaches in benchmark tasks of multi-agent reinforcement learning. The proposed approach is also shown to achieve the performance that cannot be achieved with conventional approaches in partially observable environment with exponentially large action space.


2019 ◽  
Author(s):  
Jordão Memória ◽  
José Maia

In this work, a modeling and algorithm based on multiagent reinforcement learning is developed for the problem of elevator group dispatch. The main advantage is that, along with the function approximation, this multi-agent solution leads to reduction of the state space, allowing complex states to be addressed with a synthesizing evaluation function. Each elevator is considered an agent that have to decide about two actions: answer or ignore the new call. With some iterations, the agents learn the weights of an evaluation function which approximate the state-action value function. The performance of solution (average waiting time - AWT), shown varying the traffic pattern, flow of people, number of elevators and number of floors, is comparable to other current proposals reported in the literature.


2020 ◽  
Vol 34 (04) ◽  
pp. 3741-3748
Author(s):  
Kristopher De Asis ◽  
Alan Chan ◽  
Silviu Pitis ◽  
Richard Sutton ◽  
Daniel Graves

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as “the deadly triad”). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and n-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.


2018 ◽  
Vol 2018 ◽  
pp. 1-6 ◽  
Author(s):  
Xi-liang Chen ◽  
Lei Cao ◽  
Chen-xi Li ◽  
Zhi-xiong Xu ◽  
Jun Lai

The popular deepQlearning algorithm is known to be instability because of theQ-value’s shake and overestimation action values under certain conditions. These issues tend to adversely affect their performance. In this paper, we develop the ensemble network architecture for deep reinforcement learning which is based on value function approximation. The temporal ensemble stabilizes the training process by reducing the variance of target approximation error and the ensemble of target values reduces the overestimate and makes better performance by estimating more accurateQ-value. Our results show that this architecture leads to statistically significant better value evaluation and more stable and better performance on several classical control tasks at OpenAI Gym environment.


Sign in / Sign up

Export Citation Format

Share Document