Reward Shaping and Mixed Resolution Function Approximation

Author(s):  
Marek Grzes ◽  
Daniel Kudenko

A crucial trade-off is involved in the design process when function approximation is used in reinforcement learning. Ideally the chosen representation should allow representing as closely as possible an approximation of the value function. However, the more expressive the representation the more training data is needed because the space of candidate hypotheses is larger. A less expressive representation has a smaller hypotheses space and a good candidate can be found faster. The core idea of this chapter is the use of a mixed resolution function approximation, that is, the use of a less expressive function approximation to provide useful guidance during learning, and the use of a more expressive function approximation to obtain a final result of high quality. A major question is how to combine the two representations. Two approaches are proposed and evaluated empirically: the use of two resolutions in one function approximation, and a more sophisticated algorithm with the application of reward shaping.

2016 ◽  
Vol 2016 ◽  
pp. 1-15 ◽  
Author(s):  
Shan Zhong ◽  
Quan Liu ◽  
QiMing Fu

To improve the convergence rate and the sample efficiency, two efficient learning methods AC-HMLP and RAC-HMLP (AC-HMLP withl2-regularization) are proposed by combining actor-critic algorithm with hierarchical model learning and planning. The hierarchical models consisting of the local and the global models, which are learned at the same time during learning of the value function and the policy, are approximated by local linear regression (LLR) and linear function approximation (LFA), respectively. Both the local model and the global model are applied to generate samples for planning; the former is used only if the state-prediction error does not surpass the threshold at each time step, while the latter is utilized at the end of each episode. The purpose of taking both models is to improve the sample efficiency and accelerate the convergence rate of the whole algorithm through fully utilizing the local and global information. Experimentally, AC-HMLP and RAC-HMLP are compared with three representative algorithms on two Reinforcement Learning (RL) benchmark problems. The results demonstrate that they perform best in terms of convergence rate and sample efficiency.


2013 ◽  
Vol 756-759 ◽  
pp. 3967-3971
Author(s):  
Bo Yan Ren ◽  
Zheng Qin ◽  
Feng Fei Zhao

Linear value function approximation with binary features is important in the research of Reinforcement Learning (RL). When updating the value function, it is necessary to generate a feature vector which contains the features that should be updated. In high dimensional domains, the generation process will take lot more time, which reduces the performance of algorithm a lot. Hence, this paper introduces Optional Feature Vector Generation (OFVG) algorithm as an improved method to generate feature vectors that can be combined with any online, value-based RL method that uses and expands binary features. This paper shows empirically that OFVG performs well in high dimensional domains.


Author(s):  
Paniz Behboudian ◽  
Yash Satsangi ◽  
Matthew E. Taylor ◽  
Anna Harutyunyan ◽  
Michael Bowling

AbstractReinforcement learning (RL) is a powerful learning paradigm in which agents can learn to maximize sparse and delayed reward signals. Although RL has had many impressive successes in complex domains, learning can take hours, days, or even years of training data. A major challenge of contemporary RL research is to discover how to learn with less data. Previous work has shown that domain information can be successfully used to shape the reward; by adding additional reward information, the agent can learn with much less data. Furthermore, if the reward is constructed from a potential function, the optimal policy is guaranteed to be unaltered. While such potential-based reward shaping (PBRS) holds promise, it is limited by the need for a well-defined potential function. Ideally, we would like to be able to take arbitrary advice from a human or other agent and improve performance without affecting the optimal policy. The recently introduced dynamic potential-based advice (DPBA) was proposed to tackle this challenge by predicting the potential function values as part of the learning process. However, this article demonstrates theoretically and empirically that, while DPBA can facilitate learning with good advice, it does in fact alter the optimal policy. We further show that when adding the correction term to “fix” DPBA it no longer shows effective shaping with good advice. We then present a simple method called policy invariant explicit shaping (PIES) and show theoretically and empirically that PIES can use arbitrary advice, speed-up learning, and leave the optimal policy unchanged.


Author(s):  
H. N. Mhaskar ◽  
S. V. Pereverzyev ◽  
M. D. van der Walt

The problem of real time prediction of blood glucose (BG) levels based on the readings from a continuous glucose monitoring (CGM) device is a problem of great importance in diabetes care, and therefore, has attracted a lot of research in recent years, especially based on machine learning. An accurate prediction with a 30, 60, or 90 min prediction horizon has the potential of saving millions of dollars in emergency care costs. In this paper, we treat the problem as one of function approximation, where the value of the BG level at time t+h (where h the prediction horizon) is considered to be an unknown function of d readings prior to the time t. This unknown function may be supported in particular on some unknown submanifold of the d-dimensional Euclidean space. While manifold learning is classically done in a semi-supervised setting, where the entire data has to be known in advance, we use recent ideas to achieve an accurate function approximation in a supervised setting; i.e., construct a model for the target function. We use the state-of-the-art clinically relevant PRED-EGA grid to evaluate our results, and demonstrate that for a real life dataset, our method performs better than a standard deep network, especially in hypoglycemic and hyperglycemic regimes. One noteworthy aspect of this work is that the training data and test data may come from different distributions.


2008 ◽  
Vol 17 (01) ◽  
pp. 159-174 ◽  
Author(s):  
DANIEL STRONGER ◽  
PETER STONE

In order for an autonomous agent to behave robustly in a variety of environments, it must have the ability to learn approximations to many different functions. The function approximator used by such an agent is subject to a number of constraints that may not apply in a traditional supervised learning setting. Many different function approximators exist and are appropriate for different problems. This paper proposes a set of criteria for function approximators for autonomous agents. Additionally, for those problems on which polynomial regression is a candidate technique, the paper presents an enhancement that meets these criteria. In particular, using polynomial regression typically requires a manual choice of the polynomial's degree, trading off between function accuracy and computational and memory efficiency. Polynomial Regression with Automated Degree (PRAD) is a novel function approximation method that uses training data to automatically identify an appropriate degree for the polynomial. PRAD is fully implemented. Empirical tests demonstrate its ability to efficiently and accurately approximate both a wide variety of synthetic functions and real-world data gathered by a mobile robot.


2021 ◽  
Vol 2021 (4) ◽  
pp. 163-183
Author(s):  
Wenxiao Wang ◽  
Tianhao Wang ◽  
Lun Wang ◽  
Nanqing Luo ◽  
Pan Zhou ◽  
...  

Abstract Deep learning techniques have achieved remarkable performance in wide-ranging tasks. However, when trained on privacy-sensitive datasets, the model parameters may expose private information in training data. Prior attempts for differentially private training, although offering rigorous privacy guarantees, lead to much lower model performance than the non-private ones. Besides, different runs of the same training algorithm produce models with large performance variance. To address these issues, we propose DPlis– Differentially Private Learning wIth Smoothing. The core idea of DPlis is to construct a smooth loss function that favors noise-resilient models lying in large flat regions of the loss landscape. We provide theoretical justification for the utility improvements of DPlis. Extensive experiments also demonstrate that DPlis can effectively boost model quality and training stability under a given privacy budget.


2002 ◽  
Vol 11 (02) ◽  
pp. 189-202 ◽  
Author(s):  
RUDY SETIONO ◽  
ARNULFO AZCARRAGA

Neural networks with a single hidden layer are known to be universal function approximators. However, due to the complexity of the network topology and the nonlinear transfer function used in computing the hidden unit activations, the predictions of a trained network are difficult to comprehend. On the other hand, predictions from a multiple linear regression equation are easy to understand but are not accurate when the underlying relationship between the input variables and the output variable is nonlinear. We have thus developed a method for multivariate function approximation which combines neural network learning, clustering and multiple regression. This method generates a set of multiple linear regression equations using neural networks, where the number of regression equations is determined by clustering the weighted input variables. The predictions for samples of the same cluster are computed by the same regression equation. Experimental results on a number of real-world data demonstrate that this new method generates relatively few regression equations from the training data samples. Yet, drawing from the universal function approximation capacity of neural networks, the predictive accuracy is high. The prediction errors are comparable to or lower than those achieved by existing function approximation methods.


2020 ◽  
Vol 34 (04) ◽  
pp. 3741-3748
Author(s):  
Kristopher De Asis ◽  
Alan Chan ◽  
Silviu Pitis ◽  
Richard Sutton ◽  
Daniel Graves

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as “the deadly triad”). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and n-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.


Sign in / Sign up

Export Citation Format

Share Document