Overcoming incorrect knowledge in plan-based reward shaping

2016 ◽  
Vol 31 (1) ◽  
pp. 31-43 ◽  
Author(s):  
Kyriakos Efthymiadis ◽  
Sam Devlin ◽  
Daniel Kudenko

AbstractReward shaping has been shown to significantly improve an agent’s performance in reinforcement learning. Plan-based reward shaping is a successful approach in which a STRIPS plan is used in order to guide the agent to the optimal behaviour. However, if the provided knowledge is wrong, it has been shown the agent will take longer to learn the optimal policy. Previously, in some cases, it was better to ignore all prior knowledge despite it only being partially incorrect.This paper introduces a novel use of knowledge revision to overcome incorrect domain knowledge when provided to an agent receiving plan-based reward shaping. Empirical results show that an agent using this method can outperform the previous agent receiving plan-based reward shaping without knowledge revision.

2019 ◽  
Vol 34 ◽  
Author(s):  
Mao Li ◽  
Tim Brys ◽  
Daniel Kudenko

Abstract One challenge faced by reinforcement learning (RL) agents is that in many environments the reward signal is sparse, leading to slow improvement of the agent’s performance in early learning episodes. Potential-based reward shaping can help to resolve the aforementioned issue of sparse reward by incorporating an expert’s domain knowledge into the learning through a potential function. Past work on reinforcement learning from demonstration (RLfD) directly mapped (sub-optimal) human expert demonstration to a potential function, which can speed up RL. In this paper we propose an introspective RL agent that significantly further speeds up the learning. An introspective RL agent records its state–action decisions and experience during learning in a priority queue. Good quality decisions, according to a Monte Carlo estimation, will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up RL via reward shaping. A human expert’s demonstration can be used to initialize the priority queue before the learning process starts. Experimental validation in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain shows that our approach significantly outperforms non-introspective RL and state-of-the-art approaches in RLfD in both domains.


Entropy ◽  
2021 ◽  
Vol 23 (6) ◽  
pp. 737
Author(s):  
Fengjie Sun ◽  
Xianchang Wang ◽  
Rui Zhang

An Unmanned Aerial Vehicle (UAV) can greatly reduce manpower in the agricultural plant protection such as watering, sowing, and pesticide spraying. It is essential to develop a Decision-making Support System (DSS) for UAVs to help them choose the correct action in states according to the policy. In an unknown environment, the method of formulating rules for UAVs to help them choose actions is not applicable, and it is a feasible solution to obtain the optimal policy through reinforcement learning. However, experiments show that the existing reinforcement learning algorithms cannot get the optimal policy for a UAV in the agricultural plant protection environment. In this work we propose an improved Q-learning algorithm based on similar state matching, and we prove theoretically that there has a greater probability for UAV choosing the optimal action according to the policy learned by the algorithm we proposed than the classic Q-learning algorithm in the agricultural plant protection environment. This proposed algorithm is implemented and tested on datasets that are evenly distributed based on real UAV parameters and real farm information. The performance evaluation of the algorithm is discussed in detail. Experimental results show that the algorithm we proposed can efficiently learn the optimal policy for UAVs in the agricultural plant protection environment.


2018 ◽  
Vol 71 (1) ◽  
pp. 93-102 ◽  
Author(s):  
Jennifer Wiley ◽  
Tim George ◽  
Keith Rayner

Two experiments investigated the effects of domain knowledge on the resolution of ambiguous words with dominant meanings related to baseball. When placed in a sentence context that strongly biased toward the non-baseball meaning (positive evidence), or excluded the baseball meaning (negative evidence), baseball experts had more difficulty than non-experts resolving the ambiguity. Sentence contexts containing positive evidence supported earlier resolution than did the negative evidence condition for both experts and non-experts. These experiments extend prior findings, and can be seen as support for the reordered access model of lexical access, where both prior knowledge and discourse context influence the availability of word meanings.


Author(s):  
Ying Zheng ◽  
Haoyu Chen ◽  
Qingyang Duan ◽  
Lixiang Lin ◽  
Yiyang Shao ◽  
...  

2021 ◽  
Vol 20 (5s) ◽  
pp. 1-25
Author(s):  
Zhenge Jia ◽  
Yiyu Shi ◽  
Samir Saba ◽  
Jingtong Hu

Atrial Fibrillation (AF), one of the most prevalent arrhythmias, is an irregular heart-rate rhythm causing serious health problems such as stroke and heart failure. Deep learning based methods have been exploited to provide an end-to-end AF detection by automatically extracting features from Electrocardiogram (ECG) signal and achieve state-of-the-art results. However, the pre-trained models cannot adapt to each patient’s rhythm due to the high variability of rhythm characteristics among different patients. Furthermore, the deep models are prone to overfitting when fine-tuned on the limited ECG of the specific patient for personalization. In this work, we propose a prior knowledge incorporated learning method to effectively personalize the model for patient-specific AF detection and alleviate the overfitting problems. To be more specific, a prior-incorporated portion importance mechanism is proposed to enforce the network to learn to focus on the targeted portion of the ECG, following the cardiologists’ domain knowledge in recognizing AF. A prior-incorporated regularization mechanism is further devised to alleviate model overfitting during personalization by regularizing the fine-tuning process with feature priors on typical AF rhythms of the general population. The proposed personalization method embeds the well-defined prior knowledge in diagnosing AF rhythm into the personalization procedure, which improves the personalized deep model and eliminates the workload of manually adjusting parameters in conventional AF detection method. The prior knowledge incorporated personalization is feasibly and semi-automatically conducted on the edge, device of the cardiac monitoring system. We report an average AF detection accuracy of 95.3% of three deep models over patients, surpassing the pre-trained model by a large margin of 11.5% and the fine-tuning strategy by 8.6%.


2021 ◽  
Author(s):  
Yunfan Su

Vehicular ad hoc network (VANET) is a promising technique that improves traffic safety and transportation efficiency and provides a comfortable driving experience. However, due to the rapid growth of applications that demand channel resources, efficient channel allocation schemes are required to utilize the performance of the vehicular networks. In this thesis, two Reinforcement learning (RL)-based channel allocation methods are proposed for a cognitive enabled VANET environment to maximize a long-term average system reward. First, we present a model-based dynamic programming method, which requires the calculations of the transition probabilities and time intervals between decision epochs. After obtaining the transition probabilities and time intervals, a relative value iteration (RVI) algorithm is used to find the asymptotically optimal policy. Then, we propose a model-free reinforcement learning method, in which we employ an agent to interact with the environment iteratively and learn from the feedback to approximate the optimal policy. Simulation results show that our reinforcement learning method can acquire a similar performance to that of the dynamic programming while both outperform the greedy method.


Author(s):  
Yongbiao Gao ◽  
Yu Zhang ◽  
Xin Geng

Label distribution learning (LDL) is a novel machine learning paradigm that gives a description degree of each label to an instance. However, most of training datasets only contain simple logical labels rather than label distributions due to the difficulty of obtaining the label distributions directly. We propose to use the prior knowledge to recover the label distributions. The process of recovering the label distributions from the logical labels is called label enhancement. In this paper, we formulate the label enhancement as a dynamic decision process. Thus, the label distribution is adjusted by a series of actions conducted by a reinforcement learning agent according to sequential state representations. The target state is defined by the prior knowledge. Experimental results show that the proposed approach outperforms the state-of-the-art methods in both age estimation and image emotion recognition.


Sign in / Sign up

Export Citation Format

Share Document