Policy invariant explicit shaping: an efficient alternative to reward shaping

Neural Computing and Applications ◽

10.1007/s00521-021-06259-1 ◽

2021 ◽

Author(s):

Paniz Behboudian ◽

Yash Satsangi ◽

Matthew E. Taylor ◽

Anna Harutyunyan ◽

Michael Bowling

Keyword(s):

Potential Function ◽

Optimal Policy ◽

Correction Term ◽

Training Data ◽

Simple Method ◽

Powerful Learning ◽

Dynamic Potential ◽

Reward Shaping ◽

Good Advice ◽

Efficient Alternative

AbstractReinforcement learning (RL) is a powerful learning paradigm in which agents can learn to maximize sparse and delayed reward signals. Although RL has had many impressive successes in complex domains, learning can take hours, days, or even years of training data. A major challenge of contemporary RL research is to discover how to learn with less data. Previous work has shown that domain information can be successfully used to shape the reward; by adding additional reward information, the agent can learn with much less data. Furthermore, if the reward is constructed from a potential function, the optimal policy is guaranteed to be unaltered. While such potential-based reward shaping (PBRS) holds promise, it is limited by the need for a well-defined potential function. Ideally, we would like to be able to take arbitrary advice from a human or other agent and improve performance without affecting the optimal policy. The recently introduced dynamic potential-based advice (DPBA) was proposed to tackle this challenge by predicting the potential function values as part of the learning process. However, this article demonstrates theoretically and empirically that, while DPBA can facilitate learning with good advice, it does in fact alter the optimal policy. We further show that when adding the correction term to “fix” DPBA it no longer shows effective shaping with good advice. We then present a simple method called policy invariant explicit shaping (PIES) and show theoretically and empirically that PIES can use arbitrary advice, speed-up learning, and leave the optimal policy unchanged.

Download Full-text

Asymptotics of the Titchmarsh-Weyl m-coefficient for integrable potentials

Proceedings of the Royal Society of Edinburgh Section A Mathematics ◽

10.1017/s0308210500018990 ◽

1986 ◽

Vol 103 (3-4) ◽

pp. 347-358 ◽

Cited By ~ 7

Author(s):

Hans G. Kaper ◽

Man Kam Kwong

Keyword(s):

Asymptotic Expansion ◽

Power Series ◽

Asymptotic Behaviour ◽

Potential Function ◽

Error Bound ◽

Power Series Expansion ◽

Smoothness Condition ◽

Asymptotic Power ◽

Simple Method ◽

Expansion Coefficients

This article is concerned with the asymptotic behaviour of m(λ), the Titchmarsh-Weyl m-coefficient, for the singular eigenvalue equation y“ + (λ − q(x))y = 0 on [0, ∞), as λ →∞ in a sector in the upper half of the complex plane. It is assumed that the potential function q is integrable near 0. A simplified proof is given of a result of Atkinson [7], who derived the first two terms in the asymptotic expansion of m(λ), and a sharper error bound is obtained. Theproof is then generalised to derive subsequent terms in the asymptotic expansion. It is shown that the Titchmarsh-Weyl m-coefficient admits an asymptotic power series expansion if the potential function satisfies some smoothness condition. A simple method to compute the expansion coefficients is presented. The results for the first few coefficients agree with those given by Harris [9].

Download Full-text

Plan-based reward shaping for multi-agent reinforcement learning

The Knowledge Engineering Review ◽

10.1017/s0269888915000181 ◽

2016 ◽

Vol 31 (1) ◽

pp. 44-58 ◽

Cited By ~ 3

Author(s):

Sam Devlin ◽

Daniel Kudenko

Keyword(s):

Reinforcement Learning ◽

Potential Function ◽

Single Agent ◽

Individual Agent ◽

Reward Shaping ◽

Multi Agent ◽

Theoretical Results ◽

Planning Knowledge

AbstractRecent theoretical results have justified the use of potential-based reward shaping as a way to improve the performance of multi-agent reinforcement learning (MARL). However, the question remains of how to generate a useful potential function.Previous research demonstrated the use of STRIPS operator knowledge to automatically generate a potential function for single-agent reinforcement learning. Following up on this work, we investigate the use of STRIPS planning knowledge in the context of MARL.Our results show that a potential function based on joint or individual plan knowledge can significantly improve MARL performance compared with no shaping. In addition, we investigate the limitations of individual plan knowledge as a source of reward shaping in cases where the combination of individual agent plans causes conflict.

Download Full-text

Dataset Augmentation Allows Deep Learning-Based Virtual Screening To Better Generalize To Unseen Target Classes, And Highlight Important Binding Interactions

10.1101/2020.03.06.979625 ◽

2020 ◽

Author(s):

Jack Scantlebury ◽

Nathan Brown ◽

Frank Von Delft ◽

Charlotte M. Deane

Keyword(s):

Deep Learning ◽

Protein Structure ◽

Virtual Screening ◽

Ligand Binding ◽

Training Data ◽

Learning Methods ◽

Simple Method ◽

Physical Interactions ◽

Account Information ◽

Protein Models

AbstractCurrent deep learning methods for structure-based virtual screening take the structures of both the protein and the ligand as input but make little or no use of the protein structure when predicting ligand binding. Here we show how a relatively simple method of dataset augmentation forces such deep learning methods to take into account information from the protein. Models trained in this way are more generalisable (make better predictions on protein-ligand complexes from a different distribution to the training data). They also assign more meaningful importance to the protein and ligand atoms involved in binding. Overall, our results show that dataset augmentation can help deep learning based virtual screening to learn physical interactions rather than dataset biases.Graphical TOC Entry

Download Full-text

Continuous kannada speech segmentation and speech recognition based on threshold using MFCC And VQ

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i6.pp4684-4695 ◽

2019 ◽

Vol 9 (6) ◽

pp. 4684

Author(s):

Vanajakshi Puttaswamy Gowda ◽

Mathivanan Murugavelu ◽

Senthil Kumaran Thangamuthu

Keyword(s):

Speech Recognition ◽

Language Processing ◽

Speech Signal ◽

Recognition Rate ◽

Recognition System ◽

Training Data ◽

Speech Segmentation ◽

Significant Feature ◽

Mel Frequency Cepstral Coefficients ◽

Simple Method

Continuous speech segmentation and its recognition is playing important role in natural language processing. Continuous context based Kannada speech segmentation depends on context, grammer and semantics rules present in the kannada language. The significant feature extraction of kannada speech signal for recognition system is quite exciting for researchers. In this paper proposed method is divided into two parts. First part of the method is continuous kannada speech signal segmentation with respect to the context based is carried out by computing average short term energy and its spectral centroid coefficients of the speech signal present in the specified window. The segmented outputs are completely meaningful segmentation for different scenarios with less segmentation error. The second part of the method is speech recognition by extracting less number Mel frequency cepstral coefficients with less number of codebooks using vector quantization .In this recognition is completely based on threshold value.This threshold setting is a challenging task however the simple method is used to achieve better recognition rate.The experimental results shows more efficient and effective segmentation with high recognition rate for any continuous context based kannada speech signal with different accents for male and female than the existing methods and also used minimal feature dimensions for training data.

Download Full-text

Monte Carlo Simulations with an Improved Potential Function for Cu(II)-Water Including Neighbour Ligand Corrections

Zeitschrift für Naturforschung A ◽

10.1515/zna-1991-0410 ◽

1991 ◽

Vol 46 (4) ◽

pp. 357-362 ◽

Cited By ~ 19

Author(s):

Bernd M. Rode ◽

Saiful M. Islam

Keyword(s):

Monte Carlo ◽

Monte Carlo Simulations ◽

Potential Function ◽

Hydration Shell ◽

Pair Potential ◽

Computational Effort ◽

Direct Construction ◽

Simple Method ◽

Dilute Aqueous Solution ◽

Simple Pair

Abstract Monte Carlo simulations for a Cu2+ ion in infinitely dilute aqueous solution were performed on the basis of a simple pair potential function leading to a first-shell coordination number of 8, in contrast to experimental data. A simple method was introduced therefore, which allows the direct construction of a pair potential containing the most relevant 3-body interactions by means of a correction for the nearest neighbour ligands in the ion's first hydration shell. This procedure leads to much improved results, without significant increase in computational effort during potential construction and simulation

Download Full-text

A simple melting theory applied to alkali halide, alkaline-earth chalcogenide, and alkali chalco-genide compounds

Canadian Journal of Physics ◽

10.1139/p05-004 ◽

2005 ◽

Vol 83 (6) ◽

pp. 653-660 ◽

Cited By ~ 3

Author(s):

Quan Liu ◽

Li-rong Chen

Keyword(s):

Melting Temperature ◽

Potential Function ◽

Alkali Halide ◽

Alkaline Earth ◽

Simple Method ◽

Wide Range ◽

64.70 Dv ◽

Experimental Values

A useful and simple method for studying the melting temperature Tm of ion compounds has been developed by using analyses originally due to diffusional force theory, incorporating Pandey's formulation and Harrison's potential function. The calculated values of Tm for a wide range of compounds of types IAVII (alkali halide), IIAVI (alkaline-earth chalcogenide), and IAVI (alkali chalcogenide) are found to agree fairly well with experimental values for Tm and to be superior to results from previous approaches involving similar methods. PACS Nos.: 64.70.Dv, 67.80.Gb

Download Full-text

A quick fix for modeling infiltration in water-repellent soils

10.5194/egusphere-egu21-3408 ◽

2021 ◽

Author(s):

Ryan Stewart ◽

Majdi R. Abou Najm ◽

Simone Di Prima ◽

Laurent Lassabatere

Keyword(s):

Soil Water ◽

Wide Spectrum ◽

Correction Term ◽

Water Repellency ◽

Infiltration Rate ◽

Water Repellent ◽

Soil Water Repellency ◽

Simple Method ◽

Wide Range ◽

Infiltration Models

Water repellency occurs in soils under a wide spectrum of conditions. Soil water repellency can originate from the deposition of resinous materials and exudates from vegetation, vaporization and condensation of organic compounds during fires, or the presence of anthropogenic-derived chemicals like petroleum products, wastewater or other urban contaminants. Its effects on soils range from mild to severe, and it often leads to hydrophobic conditions that can significantly impact the infiltration response with effects extending to the watershed-scale. Those effects are often time-dependent, making it a challenge to simulate infiltration behaviors of water-repellent soils using standard infiltration models. Here, we introduce a single rate-constant parameter (&#945;WR) and propose a simple correction term (1-e-&#945;WRt) to modify models for infiltration rate. This term starts with a value of zero at the beginning of the infiltration experiment (t = 0) and asymptotically approaches 1 as time increases, thus simulating a decreasing effect of soil water repellency through time. The correction term can be added to any infiltration model (one- two- or three-dimensional) and will account for the water repellency effect. Results from 165 infiltration experiments from different ecosystems and wide range of water repellency effects validated the effectiveness of this simple method to characterize water repellency in infiltration models. Tested with the simple two-term infiltration equation developed by Philip, we obtained consistent and substantial error reductions, particularly for more repellent soils. Furthermore, results revealed that soils that were burned during a wildfire had smaller &#945;WR&#160;values compared to unburned controls, thus indicating that the magnitude of &#945;WR&#160;may have a physical basis.

Download Full-text

Reward Shaping and Mixed Resolution Function Approximation

Developments in Intelligent Agent Technologies and Multi-Agent Systems ◽

10.4018/978-1-60960-171-3.ch007 ◽

2011 ◽

pp. 95-115 ◽

Cited By ~ 1

Author(s):

Marek Grzes ◽

Daniel Kudenko

Keyword(s):

Function Approximation ◽

Training Data ◽

Resolution Function ◽

Major Question ◽

Expressive Function ◽

Mixed Resolution ◽

Reward Shaping ◽

Core Idea ◽

The Value Function ◽

Sophisticated Algorithm

A crucial trade-off is involved in the design process when function approximation is used in reinforcement learning. Ideally the chosen representation should allow representing as closely as possible an approximation of the value function. However, the more expressive the representation the more training data is needed because the space of candidate hypotheses is larger. A less expressive representation has a smaller hypotheses space and a good candidate can be found faster. The core idea of this chapter is the use of a mixed resolution function approximation, that is, the use of a less expressive function approximation to provide useful guidance during learning, and the use of a more expressive function approximation to obtain a final result of high quality. A major question is how to combine the two representations. Two approaches are proposed and evaluated empirically: the use of two resolutions in one function approximation, and a more sophisticated algorithm with the application of reward shaping.

Download Full-text

Overcoming incorrect knowledge in plan-based reward shaping

The Knowledge Engineering Review ◽

10.1017/s026988891500017x ◽

2016 ◽

Vol 31 (1) ◽

pp. 31-43 ◽

Cited By ~ 1

Author(s):

Kyriakos Efthymiadis ◽

Sam Devlin ◽

Daniel Kudenko

Keyword(s):

Reinforcement Learning ◽

Prior Knowledge ◽

Optimal Policy ◽

Domain Knowledge ◽

Learning Plan ◽

Knowledge Revision ◽

Reward Shaping ◽

Successful Approach ◽

Use Of Knowledge ◽

Optimal Behaviour

AbstractReward shaping has been shown to significantly improve an agent’s performance in reinforcement learning. Plan-based reward shaping is a successful approach in which a STRIPS plan is used in order to guide the agent to the optimal behaviour. However, if the provided knowledge is wrong, it has been shown the agent will take longer to learn the optimal policy. Previously, in some cases, it was better to ignore all prior knowledge despite it only being partially incorrect.This paper introduces a novel use of knowledge revision to overcome incorrect domain knowledge when provided to an agent receiving plan-based reward shaping. Empirical results show that an agent using this method can outperform the previous agent receiving plan-based reward shaping without knowledge revision.

Download Full-text

Introspective Q-learning and learning from demonstration

The Knowledge Engineering Review ◽

10.1017/s0269888919000031 ◽

2019 ◽

Vol 34 ◽

Author(s):

Mao Li ◽

Tim Brys ◽

Daniel Kudenko

Keyword(s):

Reinforcement Learning ◽

Potential Function ◽

Domain Knowledge ◽

Experimental Validation ◽

Priority Queue ◽

Learning From Demonstration ◽

Q Learning ◽

State Action ◽

Reward Shaping ◽

Speed Up

Abstract One challenge faced by reinforcement learning (RL) agents is that in many environments the reward signal is sparse, leading to slow improvement of the agent’s performance in early learning episodes. Potential-based reward shaping can help to resolve the aforementioned issue of sparse reward by incorporating an expert’s domain knowledge into the learning through a potential function. Past work on reinforcement learning from demonstration (RLfD) directly mapped (sub-optimal) human expert demonstration to a potential function, which can speed up RL. In this paper we propose an introspective RL agent that significantly further speeds up the learning. An introspective RL agent records its state–action decisions and experience during learning in a priority queue. Good quality decisions, according to a Monte Carlo estimation, will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up RL via reward shaping. A human expert’s demonstration can be used to initialize the priority queue before the learning process starts. Experimental validation in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain shows that our approach significantly outperforms non-introspective RL and state-of-the-art approaches in RLfD in both domains.

Download Full-text