scholarly journals Episodic Self-Imitation Learning with Hindsight

Electronics ◽  
2020 ◽  
Vol 9 (10) ◽  
pp. 1742
Author(s):  
Tianhong Dai ◽  
Hengyan Liu ◽  
Anil Anthony Bharath

Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good state–action pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation.

2021 ◽  
Vol 11 (7) ◽  
pp. 3068
Author(s):  
Neda Navidi ◽  
Rene Landry

Reinforcement Learning (RL) provides effective results with an agent learning from a stand-alone reward function. However, it presents unique challenges with large amounts of environment states and action spaces, as well as in the determination of rewards. Imitation Learning (IL) offers a promising solution for those challenges using a teacher. In IL, the learning process can take advantage of human-sourced assistance and/or control over the agent and environment. A human teacher and an agent learner are considered in this study. The teacher takes part in the agent’s training towards dealing with the environment, tackling a specific objective, and achieving a predefined goal. This paper proposes a novel approach combining IL with different types of RL methods, namely, state-action-reward-state-action (SARSA) and Asynchronous Advantage Actor–Critic Agents (A3C), to overcome the problems of both stand-alone systems. How to effectively leverage the teacher’s feedback—be it direct binary or indirect detailed—for the agent learner to learn sequential decision-making policies is addressed. The results of this study on various OpenAI-Gym environments show that this algorithmic method can be incorporated with different combinations, and significantly decreases both human endeavors and tedious exploration process.


2000 ◽  
Vol 14 (2) ◽  
pp. 243-258 ◽  
Author(s):  
V. S. Borkar

A simulation-based algorithm for learning good policies for a discrete-time stochastic control process with unknown transition law is analyzed when the state and action spaces are compact subsets of Euclidean spaces. This extends the Q-learning scheme of discrete state/action problems along the lines of Baker [4]. Almost sure convergence is proved under suitable conditions.


Author(s):  
Mohd Rashdan Abdul Kadir ◽  
Ali Selamat ◽  
Ondrej Krejcar

Normative multi-agent research is an alternative viewpoint in the design of adaptive autonomous agent architecture. Norms specify the standards of behaviors such as which actions or states should be achieved or avoided. The concept of norm synthesis is the process of generating useful normative rules. This study proposes a model for normative rule extraction from implicit learning, namely using the Q-learning algorithm, into explicit norm representation by implementing Dynamic Deontics and Hierarchical Knowledge Base (HKB) to synthesize useful normative rules in the form of weighted state-action pairs with deontic modality. OpenAi Gym is used to simulate the agent environment. Our proposed model is able to generate both obligative and prohibitive norms as well as deliberate and execute said norms. Results show the generated norms are best used as prior knowledge to guide agent behavior and performs poorly if not complemented by another agent coordination mechanism. Performance increases when using both obligation and prohibition norms, and in general, norms do speed up optimum policy reachability.


Author(s):  
Bo Yang ◽  
Min Liu

Effective collaborations among autonomous unmanned aerial vehicles (UAVs) rely on timely information sharing. However, the time-varying flight environment and the intermittent link connectivity pose great challenges to message delivery. In this paper, we leverage the deep reinforcement learning (DRL) technique to address the UAVs' optimal links discovery and selection problem in uncertain environments. As the multi-agent learning efficiency is constrained by the high-dimensional and continuous action spaces, we slice the whole action spaces into a number of tractable fractions to achieve efficient convergences of optimal policies in continuous domains. Moreover, for the nonstationarity issue that particularly challenges the multi-agent DRL with local perceptions, we present a multi-agent mutual sampling method that jointly interacts the intra-agent and inter-agent state-action information to stabilize and expedite the training procedure. We evaluate the proposed algorithm on the UAVs' continuous network connection task. Results show that the associated UAVs can quickly select the optimal connected links, which facilitate the UAVs' teamwork significantly.


2021 ◽  
Vol 54 (3-4) ◽  
pp. 417-428
Author(s):  
Yanyan Dai ◽  
KiDong Lee ◽  
SukGyu Lee

For real applications, rotary inverted pendulum systems have been known as the basic model in nonlinear control systems. If researchers have no deep understanding of control, it is difficult to control a rotary inverted pendulum platform using classic control engineering models, as shown in section 2.1. Therefore, without classic control theory, this paper controls the platform by training and testing reinforcement learning algorithm. Many recent achievements in reinforcement learning (RL) have become possible, but there is a lack of research to quickly test high-frequency RL algorithms using real hardware environment. In this paper, we propose a real-time Hardware-in-the-loop (HIL) control system to train and test the deep reinforcement learning algorithm from simulation to real hardware implementation. The Double Deep Q-Network (DDQN) with prioritized experience replay reinforcement learning algorithm, without a deep understanding of classical control engineering, is used to implement the agent. For the real experiment, to swing up the rotary inverted pendulum and make the pendulum smoothly move, we define 21 actions to swing up and balance the pendulum. Comparing Deep Q-Network (DQN), the DDQN with prioritized experience replay algorithm removes the overestimate of Q value and decreases the training time. Finally, this paper shows the experiment results with comparisons of classic control theory and different reinforcement learning algorithms.


2014 ◽  
Vol 665 ◽  
pp. 643-646
Author(s):  
Ying Liu ◽  
Yan Ye ◽  
Chun Guang Li

Metalearning algorithm learns the base learning algorithm, targeted for improving the performance of the learning system. The incremental delta-bar-delta (IDBD) algorithm is such a metalearning algorithm. On the other hand, sparse algorithms are gaining popularity due to their good performance and wide applications. In this paper, we propose a sparse IDBD algorithm by taking the sparsity of the systems into account. Thenorm penalty is contained in the cost function of the standard IDBD, which is equivalent to adding a zero attractor in the iterations, thus can speed up convergence if the system of interest is indeed sparse. Simulations demonstrate that the proposed algorithm is superior to the competing algorithms in sparse system identification.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Liding Yao ◽  
Xiaojun Guan ◽  
Xiaowei Song ◽  
Yanbin Tan ◽  
Chun Wang ◽  
...  

AbstractRib fracture detection is time-consuming and demanding work for radiologists. This study aimed to introduce a novel rib fracture detection system based on deep learning which can help radiologists to diagnose rib fractures in chest computer tomography (CT) images conveniently and accurately. A total of 1707 patients were included in this study from a single center. We developed a novel rib fracture detection system on chest CT using a three-step algorithm. According to the examination time, 1507, 100 and 100 patients were allocated to the training set, the validation set and the testing set, respectively. Free Response ROC analysis was performed to evaluate the sensitivity and false positivity of the deep learning algorithm. Precision, recall, F1-score, negative predictive value (NPV) and detection and diagnosis were selected as evaluation metrics to compare the diagnostic efficiency of this system with radiologists. The radiologist-only study was used as a benchmark and the radiologist-model collaboration study was evaluated to assess the model’s clinical applicability. A total of 50,170,399 blocks (fracture blocks, 91,574; normal blocks, 50,078,825) were labelled for training. The F1-score of the Rib Fracture Detection System was 0.890 and the precision, recall and NPV values were 0.869, 0.913 and 0.969, respectively. By interacting with this detection system, the F1-score of the junior and the experienced radiologists had improved from 0.796 to 0.925 and 0.889 to 0.970, respectively; the recall scores had increased from 0.693 to 0.920 and 0.853 to 0.972, respectively. On average, the diagnosis time of radiologist assisted with this detection system was reduced by 65.3 s. The constructed Rib Fracture Detection System has a comparable performance with the experienced radiologist and is readily available to automatically detect rib fracture in the clinical setting with high efficacy, which could reduce diagnosis time and radiologists’ workload in the clinical practice.


Author(s):  
Buvanesh Pandian V

Reinforcement learning is a mathematical framework for agents to interact intelligently with their environment. Unlike supervised learning, where a system learns with the help of labeled data, reinforcement learning agents learn how to act by trial and error only receiving a reward signal from their environments. A field where reinforcement learning has been prominently successful is robotics [3]. However, real-world control problems are also particularly challenging because of the noise and high- dimensionality of input data (e.g., visual input). In recent years, in the field of supervised learning, deep neural networks have been successfully used to extract meaning from this kind of data. Building on these advances, deep reinforcement learning was used to solve complex problems like Atari games and Go. Mnih et al. [1] built a system with fixed hyper parameters able to learn to play 49 different Atari games only from raw pixel inputs. However, in order to apply the same methods to real-world control problems, deep reinforcement learning has to be able to deal with continuous action spaces. Discretizing continuous action spaces would scale poorly, since the number of discrete actions grows exponentially with the dimensionality of the action. Furthermore, having a parametrized policy can be advantageous because it can generalize in the action space. Therefore with this thesis we study state-of-the-art deep reinforcement learning algorithm, Deep Deterministic Policy Gradients. We provide a theoretical comparison to other popular methods, an evaluation of its performance, identify its limitations and investigate future directions of research. The remainder of the thesis is organized as follows. We start by introducing the field of interest, machine learning, focusing our attention of deep learning and reinforcement learning. We continue by describing in details the two main algorithms, core of this study, namely Deep Q-Network (DQN) and Deep Deterministic Policy Gradients (DDPG). We then provide implementatory details of DDPG and our test environment, followed by a description of benchmark test cases. Finally, we discuss the results of our evaluation, identifying limitations of the current approach and proposing future avenues of research.


2017 ◽  
Vol 7 (1.5) ◽  
pp. 274
Author(s):  
D. Ganesha ◽  
Vijayakumar Maragal Venkatamuni

This research work presents analysis of Modified Sarsa learning algorithm. Modified Sarsa algorithm.  State-Action-Reward-State-Action (SARSA) is an technique for learning a Markov decision process (MDP) strategy, used in for reinforcement learning int the field of artificial intelligence (AI) and machine learning (ML). The Modified SARSA Algorithm makes better actions to get better rewards.  Experiment are conducted to evaluate the performace for each agent individually. For result comparison among different agent, the same statistics were collected. This work considered varied kind of agents in different level of architecture for experiment analysis. The Fungus world testbed has been considered for experiment which is has been implemented using SwI-Prolog 5.4.6. The fixed obstructs tend to be more versatile, to make a location that is specific to Fungus world testbed environment. The various parameters are introduced in an environment to test a agent’s performance. This modified   SARSA learning algorithm can   be more suitable in EMCAP architecture.  The experiments are conducted the modified   SARSA Learning system gets   more rewards compare to existing  SARSA algorithm.


1994 ◽  
Vol 05 (01) ◽  
pp. 67-75 ◽  
Author(s):  
BYOUNG-TAK ZHANG

Much previous work on training multilayer neural networks has attempted to speed up the backpropagation algorithm using more sophisticated weight modification rules, whereby all the given training examples are used in a random or predetermined sequence. In this paper we investigate an alternative approach in which the learning proceeds on an increasing number of selected training examples, starting with a small training set. We derive a measure of criticality of examples and present an incremental learning algorithm that uses this measure to select a critical subset of given examples for solving the particular task. Our experimental results suggest that the method can significantly improve training speed and generalization performance in many real applications of neural networks. This method can be used in conjunction with other variations of gradient descent algorithms.


Sign in / Sign up

Export Citation Format

Share Document