scholarly journals Policy invariant explicit shaping: an efficient alternative to reward shaping

Author(s):  
Paniz Behboudian ◽  
Yash Satsangi ◽  
Matthew E. Taylor ◽  
Anna Harutyunyan ◽  
Michael Bowling

AbstractReinforcement learning (RL) is a powerful learning paradigm in which agents can learn to maximize sparse and delayed reward signals. Although RL has had many impressive successes in complex domains, learning can take hours, days, or even years of training data. A major challenge of contemporary RL research is to discover how to learn with less data. Previous work has shown that domain information can be successfully used to shape the reward; by adding additional reward information, the agent can learn with much less data. Furthermore, if the reward is constructed from a potential function, the optimal policy is guaranteed to be unaltered. While such potential-based reward shaping (PBRS) holds promise, it is limited by the need for a well-defined potential function. Ideally, we would like to be able to take arbitrary advice from a human or other agent and improve performance without affecting the optimal policy. The recently introduced dynamic potential-based advice (DPBA) was proposed to tackle this challenge by predicting the potential function values as part of the learning process. However, this article demonstrates theoretically and empirically that, while DPBA can facilitate learning with good advice, it does in fact alter the optimal policy. We further show that when adding the correction term to “fix” DPBA it no longer shows effective shaping with good advice. We then present a simple method called policy invariant explicit shaping (PIES) and show theoretically and empirically that PIES can use arbitrary advice, speed-up learning, and leave the optimal policy unchanged.

1986 ◽  
Vol 103 (3-4) ◽  
pp. 347-358 ◽  
Author(s):  
Hans G. Kaper ◽  
Man Kam Kwong

This article is concerned with the asymptotic behaviour of m(λ), the Titchmarsh-Weyl m-coefficient, for the singular eigenvalue equation y“ + (λ − q(x))y = 0 on [0, ∞), as λ →∞ in a sector in the upper half of the complex plane. It is assumed that the potential function q is integrable near 0. A simplified proof is given of a result of Atkinson [7], who derived the first two terms in the asymptotic expansion of m(λ), and a sharper error bound is obtained. Theproof is then generalised to derive subsequent terms in the asymptotic expansion. It is shown that the Titchmarsh-Weyl m-coefficient admits an asymptotic power series expansion if the potential function satisfies some smoothness condition. A simple method to compute the expansion coefficients is presented. The results for the first few coefficients agree with those given by Harris [9].


2016 ◽  
Vol 31 (1) ◽  
pp. 44-58 ◽  
Author(s):  
Sam Devlin ◽  
Daniel Kudenko

AbstractRecent theoretical results have justified the use of potential-based reward shaping as a way to improve the performance of multi-agent reinforcement learning (MARL). However, the question remains of how to generate a useful potential function.Previous research demonstrated the use of STRIPS operator knowledge to automatically generate a potential function for single-agent reinforcement learning. Following up on this work, we investigate the use of STRIPS planning knowledge in the context of MARL.Our results show that a potential function based on joint or individual plan knowledge can significantly improve MARL performance compared with no shaping. In addition, we investigate the limitations of individual plan knowledge as a source of reward shaping in cases where the combination of individual agent plans causes conflict.


2020 ◽  
Author(s):  
Jack Scantlebury ◽  
Nathan Brown ◽  
Frank Von Delft ◽  
Charlotte M. Deane

AbstractCurrent deep learning methods for structure-based virtual screening take the structures of both the protein and the ligand as input but make little or no use of the protein structure when predicting ligand binding. Here we show how a relatively simple method of dataset augmentation forces such deep learning methods to take into account information from the protein. Models trained in this way are more generalisable (make better predictions on protein-ligand complexes from a different distribution to the training data). They also assign more meaningful importance to the protein and ligand atoms involved in binding. Overall, our results show that dataset augmentation can help deep learning based virtual screening to learn physical interactions rather than dataset biases.Graphical TOC Entry


Author(s):  
Vanajakshi Puttaswamy Gowda ◽  
Mathivanan Murugavelu ◽  
Senthil Kumaran Thangamuthu

<p><span>Continuous speech segmentation and its  recognition is playing important role in natural language processing. Continuous context based Kannada speech segmentation depends  on context, grammer and semantics rules present in the kannada language. The significant feature extraction of kannada speech signal  for recognition system is quite exciting for researchers. In this paper proposed method  is  divided into two parts. First part of the method is continuous kannada speech signal segmentation with respect to the context based is carried out  by computing  average short term energy and its spectral centroid coefficients of  the speech signal present in the specified window. The segmented outputs are completely  meaningful  segmentation  for different scenarios with less segmentation error. The second part of the method is speech recognition by extracting less number Mel frequency cepstral coefficients with less  number of codebooks  using vector quantization .In this recognition is completely based on threshold value.This threshold setting is a challenging task however the simple method is used to achieve better recognition rate.The experimental results shows more efficient  and effective segmentation    with high recognition rate for any continuous context based kannada speech signal with different accents for male and female than the existing methods and also used minimal feature dimensions for training data.</span></p>


1991 ◽  
Vol 46 (4) ◽  
pp. 357-362 ◽  
Author(s):  
Bernd M. Rode ◽  
Saiful M. Islam

Abstract Monte Carlo simulations for a Cu2+ ion in infinitely dilute aqueous solution were performed on the basis of a simple pair potential function leading to a first-shell coordination number of 8, in contrast to experimental data. A simple method was introduced therefore, which allows the direct construction of a pair potential containing the most relevant 3-body interactions by means of a correction for the nearest neighbour ligands in the ion's first hydration shell. This procedure leads to much improved results, without significant increase in computational effort during potential construction and simulation


2005 ◽  
Vol 83 (6) ◽  
pp. 653-660 ◽  
Author(s):  
Quan Liu ◽  
Li-rong Chen

A useful and simple method for studying the melting temperature Tm of ion compounds has been developed by using analyses originally due to diffusional force theory, incorporating Pandey's formulation and Harrison's potential function. The calculated values of Tm for a wide range of compounds of types IA–VII (alkali halide), IIA–VI (alkaline-earth chalcogenide), and IA–VI (alkali chalcogenide) are found to agree fairly well with experimental values for Tm and to be superior to results from previous approaches involving similar methods. PACS Nos.: 64.70.Dv, 67.80.Gb


2021 ◽  
Author(s):  
Ryan Stewart ◽  
Majdi R. Abou Najm ◽  
Simone Di Prima ◽  
Laurent Lassabatere

&lt;p&gt;Water repellency occurs in soils under a wide spectrum of conditions. Soil water repellency can originate from the deposition of resinous materials and exudates from vegetation, vaporization and condensation of organic compounds during fires, or the presence of anthropogenic-derived chemicals like petroleum products, wastewater or other urban contaminants. Its effects on soils range from mild to severe, and it often leads to hydrophobic conditions that can significantly impact the infiltration response with effects extending to the watershed-scale. Those effects are often time-dependent, making it a challenge to simulate infiltration behaviors of water-repellent soils using standard infiltration models. Here, we introduce a single rate-constant parameter (&amp;#945;&lt;sub&gt;WR&lt;/sub&gt;) and propose a simple correction term (1-e&lt;sup&gt;-&amp;#945;WRt&lt;/sup&gt;) to modify models for infiltration rate. This term starts with a value of zero at the beginning of the infiltration experiment (t = 0) and asymptotically approaches 1 as time increases, thus simulating a decreasing effect of soil water repellency through time. The correction term can be added to any infiltration model (one- two- or three-dimensional) and will account for the water repellency effect. Results from 165 infiltration experiments from different ecosystems and wide range of water repellency effects validated the effectiveness of this simple method to characterize water repellency in infiltration models. Tested with the simple two-term infiltration equation developed by Philip, we obtained consistent and substantial error reductions, particularly for more repellent soils. Furthermore, results revealed that soils that were burned during a wildfire had smaller &amp;#945;&lt;sub&gt;WR&lt;/sub&gt;&amp;#160;values compared to unburned controls, thus indicating that the magnitude of &amp;#945;&lt;sub&gt;WR&lt;/sub&gt;&amp;#160;may have a physical basis.&lt;/p&gt;


Author(s):  
Marek Grzes ◽  
Daniel Kudenko

A crucial trade-off is involved in the design process when function approximation is used in reinforcement learning. Ideally the chosen representation should allow representing as closely as possible an approximation of the value function. However, the more expressive the representation the more training data is needed because the space of candidate hypotheses is larger. A less expressive representation has a smaller hypotheses space and a good candidate can be found faster. The core idea of this chapter is the use of a mixed resolution function approximation, that is, the use of a less expressive function approximation to provide useful guidance during learning, and the use of a more expressive function approximation to obtain a final result of high quality. A major question is how to combine the two representations. Two approaches are proposed and evaluated empirically: the use of two resolutions in one function approximation, and a more sophisticated algorithm with the application of reward shaping.


2016 ◽  
Vol 31 (1) ◽  
pp. 31-43 ◽  
Author(s):  
Kyriakos Efthymiadis ◽  
Sam Devlin ◽  
Daniel Kudenko

AbstractReward shaping has been shown to significantly improve an agent’s performance in reinforcement learning. Plan-based reward shaping is a successful approach in which a STRIPS plan is used in order to guide the agent to the optimal behaviour. However, if the provided knowledge is wrong, it has been shown the agent will take longer to learn the optimal policy. Previously, in some cases, it was better to ignore all prior knowledge despite it only being partially incorrect.This paper introduces a novel use of knowledge revision to overcome incorrect domain knowledge when provided to an agent receiving plan-based reward shaping. Empirical results show that an agent using this method can outperform the previous agent receiving plan-based reward shaping without knowledge revision.


2019 ◽  
Vol 34 ◽  
Author(s):  
Mao Li ◽  
Tim Brys ◽  
Daniel Kudenko

Abstract One challenge faced by reinforcement learning (RL) agents is that in many environments the reward signal is sparse, leading to slow improvement of the agent’s performance in early learning episodes. Potential-based reward shaping can help to resolve the aforementioned issue of sparse reward by incorporating an expert’s domain knowledge into the learning through a potential function. Past work on reinforcement learning from demonstration (RLfD) directly mapped (sub-optimal) human expert demonstration to a potential function, which can speed up RL. In this paper we propose an introspective RL agent that significantly further speeds up the learning. An introspective RL agent records its state–action decisions and experience during learning in a priority queue. Good quality decisions, according to a Monte Carlo estimation, will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up RL via reward shaping. A human expert’s demonstration can be used to initialize the priority queue before the learning process starts. Experimental validation in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain shows that our approach significantly outperforms non-introspective RL and state-of-the-art approaches in RLfD in both domains.


Sign in / Sign up

Export Citation Format

Share Document