scholarly journals Gamma-Nets: Generalizing Value Estimation over Timescale

2020 ◽  
Vol 34 (04) ◽  
pp. 5717-5725
Author(s):  
Craig Sherstan ◽  
Shibhansh Dohare ◽  
James MacGlashan ◽  
Johannes Günther ◽  
Patrick M. Pilarski

Temporal abstraction is a key requirement for agents making decisions over long time horizons—a fundamental challenge in reinforcement learning. There are many reasons why value estimates at multiple timescales might be useful; recent work has shown that value estimates at different time scales can be the basis for creating more advanced discounting functions and for driving representation learning. Further, predictions at many different timescales serve to broaden an agent's model of its environment. One predictive approach of interest within an online learning setting is general value function (GVFs), which represent models of an agent's world as a collection of predictive questions each defined by a policy, a signal to be predicted, and a prediction timescale. In this paper we present Γ-nets, a method for generalizing value function estimation over timescale, allowing a given GVF to be trained and queried for arbitrary timescales so as to greatly increase the predictive ability and scalability of a GVF-based model. The key to our approach is to use timescale as one of the value estimator's inputs. As a result, the prediction target for any timescale is available at every timestep and we are free to train on any number of timescales. We first provide two demonstrations by 1) predicting a square wave and 2) predicting sensorimotor signals on a robot arm using a linear function approximator. Next, we empirically evaluate Γ-nets in the deep reinforcement learning setting using policy evaluation on a set of Atari video games. Our results show that Γ-nets can be effective for predicting arbitrary timescales, with only a small cost in accuracy as compared to learning estimators for fixed timescales. Γ-nets provide a method for accurately and compactly making predictions at many timescales without requiring a priori knowledge of the task, making it a valuable contribution to ongoing work on model-based planning, representation learning, and lifelong learning algorithms.

Mathematics ◽  
2020 ◽  
Vol 8 (9) ◽  
pp. 1479
Author(s):  
Francisco Martinez-Gil ◽  
Miguel Lozano ◽  
Ignacio García-Fernández ◽  
Pau Romero ◽  
Dolors Serra ◽  
...  

Reinforcement learning is one of the most promising machine learning techniques to get intelligent behaviors for embodied agents in simulations. The output of the classic Temporal Difference family of Reinforcement Learning algorithms adopts the form of a value function expressed as a numeric table or a function approximator. The learned behavior is then derived using a greedy policy with respect to this value function. Nevertheless, sometimes the learned policy does not meet expectations, and the task of authoring is difficult and unsafe because the modification of one value or parameter in the learned value function has unpredictable consequences in the space of the policies it represents. This invalidates direct manipulation of the learned value function as a method to modify the derived behaviors. In this paper, we propose the use of Inverse Reinforcement Learning to incorporate real behavior traces in the learning process to shape the learned behaviors, thus increasing their trustworthiness (in terms of conformance to reality). To do so, we adapt the Inverse Reinforcement Learning framework to the navigation problem domain. Specifically, we use Soft Q-learning, an algorithm based on the maximum causal entropy principle, with MARL-Ped (a Reinforcement Learning-based pedestrian simulator) to include information from trajectories of real pedestrians in the process of learning how to navigate inside a virtual 3D space that represents the real environment. A comparison with the behaviors learned using a Reinforcement Learning classic algorithm (Sarsa(λ)) shows that the Inverse Reinforcement Learning behaviors adjust significantly better to the real trajectories.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Danling Dong ◽  
Libo Wu

At present, there is a serious disconnect between online teaching and offline teaching in English MOOC large-scale hybrid teaching recommendation platform, which is mainly due to the problems of cold start and matrix sparsity in the recommendation algorithm, and it is difficult to fully tap the user's interest characteristics because it only considers the user's rating but neglects the user's personalized evaluation. In order to solve the above problems, this paper proposes to use reinforcement learning thought and user evaluation factors to realize the online and offline hybrid English teaching recommendation platform. First, the idea of value function estimation in reinforcement learning is introduced, and the difference between user state value functions is used to replace the previous similarity calculation method, thus alleviating the matrix sparsity problem. The learning rate is used to control the convergence speed of the weight vector in the user state value function to alleviate the cold start problem. Second, by adding the learning of the user evaluation vector to the value function estimation of the state value function, the state value function of the user can be estimated approximately and the discrimination degree of the target user can be reflected. Experimental results show that the proposed recommendation algorithm can effectively alleviate the cold start and matrix sparsity problems existing in the current collaborative filtering recommendation algorithm and can dig deep into the characteristics of users' interests and further improve the accuracy of scoring prediction.


2016 ◽  
Vol 75 (s1) ◽  
Author(s):  
Roberto Bertoni ◽  
Martino Bertoni ◽  
Giuseppe Morabito ◽  
Michela Rogora ◽  
Cristiana Callieri

<p>Limnologists have long recognized that one of the goals of their discipline is to increase its predictive capability. In recent years, the role of prediction in applied ecology escalated, mainly due to man’s increased ability to change the biosphere. Such alterations often came with unplanned and noticeably negative side effects mushrooming from lack of proper attention to long-term consequences. Regression analysis of common limnological parameters has been successfully applied to develop predictive models relating the variability of limnological parameters to specific key causes. These approaches, though, are biased by the requirement of a priori cause-relation assumption, oftentimes difficult to find in the complex, nonlinear relationships entangling ecological data. A set of quantitative tools that can help addressing current environmental challenges avoiding such restrictions is currently being researched and developed within the framework of ecological informatics. One of these approaches attempting to model the relationship between a set of inputs and known outputs, is based on genetic algorithms and programming (GP). This stochastic optimization tool is based on the process of evolution in natural systems and was inspired by a direct analogy to sexual reproduction and Charles Darwin’s principle of natural selection. GP works through genetic algorithms that use selection and recombination operators to generate a population of equations. Thanks to a 25-years long time-series of regular limnological data, the deep, large, oligotrophic Lake Maggiore (Northern Italy) is the ideal case study to test the predictive ability of GP. Testing of GP on the multi-year data series of this lake has allowed us to verify the forecasting efficacy of the models emerging from GP application. In addition, this non-deterministic approach leads to the discovery of non-obvious relationships between variables and enabled the formulation of new stochastic models.</p>


2021 ◽  
Vol 70 ◽  
pp. 319-349
Author(s):  
Yongcan Cao ◽  
Huixin Zhan

Solving multi-objective optimization problems is important in various applications where users are interested in obtaining optimal policies subject to multiple (yet often conflicting) objectives. A typical approach to obtain the optimal policies is to first construct a loss function based on the scalarization of individual objectives and then derive optimal policies that minimize the scalarized loss function. Albeit simple and efficient, the typical approach provides no insights/mechanisms on the optimization of multiple objectives due to the lack of ability to quantify the inter-objective relationship. To address the issue, we propose to develop a new efficient gradient-based multi-objective reinforcement learning approach that seeks to iteratively uncover the quantitative inter-objective relationship via finding a minimum-norm point in the convex hull of the set of multiple policy gradients when the impact of one objective on others is unknown a priori. In particular, we first propose a new PAOLS algorithm that integrates pruning and approximate optimistic linear support algorithm to efficiently discover the weight-vector sets of multiple gradients that quantify the inter-objective relationship. Then we construct an actor and a multi-objective critic that can co-learn the policy and the multi-objective vector value function. Finally, the weight discovery process and the policy and vector value function learning process can be iteratively executed to yield stable weight-vector sets and policies. To validate the effectiveness of the proposed approach, we present a quantitative evaluation of the approach based on three case studies.


Author(s):  
Ling Pan ◽  
Qingpeng Cai ◽  
Qi Meng ◽  
Wei Chen ◽  
Longbo Huang

Value function estimation is an important task in reinforcement learning, i.e., prediction. The Boltzmann softmax operator is a natural value estimator and can provide several benefits. However, it does not satisfy the non-expansion property, and its direct use may fail to converge even in value iteration. In this paper, we propose to update the value function with dynamic Boltzmann softmax (DBS) operator, which has good convergence property in the setting of planning and learning. Experimental results on GridWorld show that the DBS operator enables better estimation of the value function, which rectifies the convergence issue of the softmax operator. Finally, we propose the DBS-DQN algorithm by applying the DBS operator, which outperforms DQN substantially in 40 out of 49 Atari games.


2020 ◽  
Author(s):  
Gabriel Moraes Barros ◽  
Esther Colombini

In robotics, the ultimate goal of reinforcement learning is to endow robots with the ability to learn, improve, adapt, and reproduce tasks with dynamically changing constraints based on exploration and autonomous learning. Reinforcement Learning (RL) aims at addressing this problem by enabling a robot to learn behaviors through trial-and-error. With RL, a Neural Network can be trained as a function approximator to directly map states to actuator commands making any predefined control structure not-needed for training. However, the knowledge required to converge these methods is usually built from scratch. Learning may take a long time, not to mention that RL algorithms need a stated reward function. Sometimes, it is not trivial to define one. Often it is easier for a teacher, human or intelligent agent, do demonstrate the desired behavior or how to accomplish a given task. Humans and other animals have a natural ability to learn skills from observation, often from merely seeing these skills’ effects: without direct knowledge of the underlying actions. The same principle exists in Imitation Learning, a practical approach for autonomous systems to acquire control policies when an explicit reward function is unavailable, using supervision provided as demonstrations from an expert, typically a human operator. In this scenario, this work’s primary objective is to design an agent that can successfully imitate a prior acquired control policy using Imitation Learning. The chosen algorithm is GAIL since we consider that it is the proper algorithm to tackle this problem by utilizing expert (state, action) trajectories. As reference expert trajectories, we implement state-of-the-art on and off-policy methods PPO and SAC. Results show that the learned policies for all three methods can solve the task of low-level control of a quadrotor and that all can account for generalization on the original tasks.


Sign in / Sign up

Export Citation Format

Share Document