Reward Shaping and Mixed Resolution Function Approximation

A crucial trade-off is involved in the design process when function approximation is used in reinforcement learning. Ideally the chosen representation should allow representing as closely as possible an approximation of the value function. However, the more expressive the representation the more training data is needed because the space of candidate hypotheses is larger. A less expressive representation has a smaller hypotheses space and a good candidate can be found faster. The core idea of this chapter is the use of a mixed resolution function approximation, that is, the use of a less expressive function approximation to provide useful guidance during learning, and the use of a more expressive function approximation to obtain a final result of high quality. A major question is how to combine the two representations. Two approaches are proposed and evaluated empirically: the use of two resolutions in one function approximation, and a more sophisticated algorithm with the application of reward shaping.

Download Full-text

Reinforcement Learning with Reward Shaping and Mixed Resolution Function Approximation

International Journal of Agent Technologies and Systems ◽

10.4018/jats.2009040103 ◽

2009 ◽

Vol 1 (2) ◽

pp. 36-54 ◽

Cited By ~ 1

Author(s):

Marek Grzes ◽

Daniel Kudenko

Keyword(s):

Reinforcement Learning ◽

Function Approximation ◽

Resolution Function ◽

Mixed Resolution ◽

Reward Shaping

Download Full-text

Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning

Computational Intelligence and Neuroscience ◽

10.1155/2016/4824072 ◽

2016 ◽

Vol 2016 ◽

pp. 1-15 ◽

Cited By ~ 1

Author(s):

Shan Zhong ◽

Quan Liu ◽

QiMing Fu

Keyword(s):

Convergence Rate ◽

Hierarchical Model ◽

Function Approximation ◽

Local Linear Regression ◽

Benchmark Problems ◽

Time Step ◽

Model Learning ◽

Linear Function Approximation ◽

Efficient Learning ◽

The Value Function

To improve the convergence rate and the sample efficiency, two efficient learning methods AC-HMLP and RAC-HMLP (AC-HMLP withl2-regularization) are proposed by combining actor-critic algorithm with hierarchical model learning and planning. The hierarchical models consisting of the local and the global models, which are learned at the same time during learning of the value function and the policy, are approximated by local linear regression (LLR) and linear function approximation (LFA), respectively. Both the local model and the global model are applied to generate samples for planning; the former is used only if the state-prediction error does not surpass the threshold at each time step, while the latter is utilized at the end of each episode. The purpose of taking both models is to improve the sample efficiency and accelerate the convergence rate of the whole algorithm through fully utilizing the local and global information. Experimentally, AC-HMLP and RAC-HMLP are compared with three representative algorithms on two Reinforcement Learning (RL) benchmark problems. The results demonstrate that they perform best in terms of convergence rate and sample efficiency.

Download Full-text

Optional Feature Vector Generation for Linear Value Function Approximation with Binary Features

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.3967 ◽

2013 ◽

Vol 756-759 ◽

pp. 3967-3971

Author(s):

Bo Yan Ren ◽

Zheng Qin ◽

Feng Fei Zhao

Keyword(s):

Function Approximation ◽

Value Function ◽

Feature Vector ◽

High Dimensional ◽

Generation Process ◽

Value Function Approximation ◽

Vector Generation ◽

Binary Features ◽

Performance Of Algorithm ◽

The Value Function

Linear value function approximation with binary features is important in the research of Reinforcement Learning (RL). When updating the value function, it is necessary to generate a feature vector which contains the features that should be updated. In high dimensional domains, the generation process will take lot more time, which reduces the performance of algorithm a lot. Hence, this paper introduces Optional Feature Vector Generation (OFVG) algorithm as an improved method to generate feature vectors that can be combined with any online, value-based RL method that uses and expands binary features. This paper shows empirically that OFVG performs well in high dimensional domains.

Download Full-text

Policy invariant explicit shaping: an efficient alternative to reward shaping

Neural Computing and Applications ◽

10.1007/s00521-021-06259-1 ◽

2021 ◽

Author(s):

Paniz Behboudian ◽

Yash Satsangi ◽

Matthew E. Taylor ◽

Anna Harutyunyan ◽

Michael Bowling

Keyword(s):

Potential Function ◽

Optimal Policy ◽

Correction Term ◽

Training Data ◽

Simple Method ◽

Powerful Learning ◽

Dynamic Potential ◽

Reward Shaping ◽

Good Advice ◽

Efficient Alternative

AbstractReinforcement learning (RL) is a powerful learning paradigm in which agents can learn to maximize sparse and delayed reward signals. Although RL has had many impressive successes in complex domains, learning can take hours, days, or even years of training data. A major challenge of contemporary RL research is to discover how to learn with less data. Previous work has shown that domain information can be successfully used to shape the reward; by adding additional reward information, the agent can learn with much less data. Furthermore, if the reward is constructed from a potential function, the optimal policy is guaranteed to be unaltered. While such potential-based reward shaping (PBRS) holds promise, it is limited by the need for a well-defined potential function. Ideally, we would like to be able to take arbitrary advice from a human or other agent and improve performance without affecting the optimal policy. The recently introduced dynamic potential-based advice (DPBA) was proposed to tackle this challenge by predicting the potential function values as part of the learning process. However, this article demonstrates theoretically and empirically that, while DPBA can facilitate learning with good advice, it does in fact alter the optimal policy. We further show that when adding the correction term to “fix” DPBA it no longer shows effective shaping with good advice. We then present a simple method called policy invariant explicit shaping (PIES) and show theoretically and empirically that PIES can use arbitrary advice, speed-up learning, and leave the optimal policy unchanged.

Download Full-text

A Function Approximation Approach to the Prediction of Blood Glucose Levels

Frontiers in Applied Mathematics and Statistics ◽

10.3389/fams.2021.707884 ◽

2021 ◽

Vol 7 ◽

Author(s):

H. N. Mhaskar ◽

S. V. Pereverzyev ◽

M. D. van der Walt

Keyword(s):

Blood Glucose ◽

Unknown Function ◽

Function Approximation ◽

Real Life ◽

Glucose Monitoring ◽

Target Function ◽

Training Data ◽

Dimensional Euclidean Space ◽

Glucose Levels ◽

Prediction Horizon

The problem of real time prediction of blood glucose (BG) levels based on the readings from a continuous glucose monitoring (CGM) device is a problem of great importance in diabetes care, and therefore, has attracted a lot of research in recent years, especially based on machine learning. An accurate prediction with a 30, 60, or 90 min prediction horizon has the potential of saving millions of dollars in emergency care costs. In this paper, we treat the problem as one of function approximation, where the value of the BG level at time t+h (where h the prediction horizon) is considered to be an unknown function of d readings prior to the time t. This unknown function may be supported in particular on some unknown submanifold of the d-dimensional Euclidean space. While manifold learning is classically done in a semi-supervised setting, where the entire data has to be known in advance, we use recent ideas to achieve an accurate function approximation in a supervised setting; i.e., construct a model for the target function. We use the state-of-the-art clinically relevant PRED-EGA grid to evaluate our results, and demonstrate that for a real life dataset, our method performs better than a standard deep network, especially in hypoglycemic and hyperglycemic regimes. One noteworthy aspect of this work is that the training data and test data may come from different distributions.

Download Full-text

POLYNOMIAL REGRESSION WITH AUTOMATED DEGREE: A FUNCTION APPROXIMATOR FOR AUTONOMOUS AGENTS

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213008003820 ◽

2008 ◽

Vol 17 (01) ◽

pp. 159-174 ◽

Cited By ~ 3

Author(s):

DANIEL STRONGER ◽

PETER STONE

Keyword(s):

Approximation Method ◽

Function Approximation ◽

Autonomous Agents ◽

Polynomial Regression ◽

Training Data ◽

Autonomous Agent ◽

Real World Data ◽

Memory Efficiency ◽

Empirical Tests ◽

Function Approximator

In order for an autonomous agent to behave robustly in a variety of environments, it must have the ability to learn approximations to many different functions. The function approximator used by such an agent is subject to a number of constraints that may not apply in a traditional supervised learning setting. Many different function approximators exist and are appropriate for different problems. This paper proposes a set of criteria for function approximators for autonomous agents. Additionally, for those problems on which polynomial regression is a candidate technique, the paper presents an enhancement that meets these criteria. In particular, using polynomial regression typically requires a manual choice of the polynomial's degree, trading off between function accuracy and computational and memory efficiency. Polynomial Regression with Automated Degree (PRAD) is a novel function approximation method that uses training data to automatically identify an appropriate degree for the polynomial. PRAD is fully implemented. Empirical tests demonstrate its ability to efficiently and accurately approximate both a wide variety of synthetic functions and real-world data gathered by a mobile robot.

Download Full-text

Hölder continuity of the policy function approximation in the value function approximation

Journal of Mathematical Economics ◽

10.1016/j.jmateco.2007.01.004 ◽

2007 ◽

Vol 43 (5) ◽

pp. 629-639 ◽

Cited By ~ 3

Author(s):

Wilfredo L. Maldonado ◽

B.F. Svaiter

Keyword(s):

Function Approximation ◽

Value Function ◽

Hölder Continuity ◽

Holder Continuity ◽

Value Function Approximation ◽

The Value Function ◽

Policy Function

Download Full-text

DPlis: Boosting Utility of Differentially Private Deep Learning via Randomized Smoothing

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2021-0065 ◽

2021 ◽

Vol 2021 (4) ◽

pp. 163-183

Author(s):

Wenxiao Wang ◽

Tianhao Wang ◽

Lun Wang ◽

Nanqing Luo ◽

Pan Zhou ◽

...

Keyword(s):

Deep Learning ◽

Private Information ◽

Model Performance ◽

Training Data ◽

Model Parameters ◽

Model Quality ◽

Learning Techniques ◽

Core Idea ◽

Privacy Budget ◽

And Training

Abstract Deep learning techniques have achieved remarkable performance in wide-ranging tasks. However, when trained on privacy-sensitive datasets, the model parameters may expose private information in training data. Prior attempts for differentially private training, although offering rigorous privacy guarantees, lead to much lower model performance than the non-private ones. Besides, different runs of the same training algorithm produce models with large performance variance. To address these issues, we propose DPlis– Differentially Private Learning wIth Smoothing. The core idea of DPlis is to construct a smooth loss function that favors noise-resilient models lying in large flat regions of the loss landscape. We provide theoretical justification for the utility improvements of DPlis. Extensive experiments also demonstrate that DPlis can effectively boost model quality and training stability under a given privacy budget.

Download Full-text

GENERATING CONCISE SETS OF LINEAR REGRESSION RULES FROM ARTIFICIAL NEURAL NETWORKS

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213002000848 ◽

2002 ◽

Vol 11 (02) ◽

pp. 189-202 ◽

Cited By ~ 5

Author(s):

RUDY SETIONO ◽

ARNULFO AZCARRAGA

Keyword(s):

Neural Networks ◽

Linear Regression ◽

Multiple Linear Regression ◽

Function Approximation ◽

Predictive Accuracy ◽

Universal Function ◽

Regression Equation ◽

Training Data ◽

Regression Equations ◽

Input Variables

Neural networks with a single hidden layer are known to be universal function approximators. However, due to the complexity of the network topology and the nonlinear transfer function used in computing the hidden unit activations, the predictions of a trained network are difficult to comprehend. On the other hand, predictions from a multiple linear regression equation are easy to understand but are not accurate when the underlying relationship between the input variables and the output variable is nonlinear. We have thus developed a method for multivariate function approximation which combines neural network learning, clustering and multiple regression. This method generates a set of multiple linear regression equations using neural networks, where the number of regression equations is determined by clustering the weighted input variables. The predictions for samples of the same cluster are computed by the same regression equation. Experimental results on a number of real-world data demonstrate that this new method generates relatively few regression equations from the training data samples. Yet, drawing from the universal function approximation capacity of neural networks, the predictive accuracy is high. The prediction errors are comparable to or lower than those achieved by existing function approximation methods.

Download Full-text

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5784 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3741-3748

Author(s):

Kristopher De Asis ◽

Alan Chan ◽

Silviu Pitis ◽

Richard Sutton ◽

Daniel Graves

Keyword(s):

Reinforcement Learning ◽

Function Approximation ◽

Value Function ◽

Temporal Difference ◽

Value Functions ◽

Difference Methods ◽

Td Methods ◽

The Stability ◽

The Value Function ◽

Temporal Difference Methods

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h−1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as “the deadly triad”). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and n-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.

Download Full-text