Deep Reinforcement Learning with Adaptive Update Target Combination

Abstract Simple and efficient exploration remains a core challenge in deep reinforcement learning. While many exploration methods can be applied to high-dimensional tasks, these methods manually adjust exploration parameters according to domain knowledge. This paper proposes a novel method that can automatically balance exploration and exploitation, as well as combine on-policy and off-policy update targets through a dynamic weighted way based on value difference. The proposed method does not directly affect the probability of a selected action but utilizes the value difference produced during the learning process to adjust update target for guiding the direction of agent’s learning. We demonstrate the performance of the proposed method on CartPole-v1, MountainCar-v0, and LunarLander-v2 classic control tasks from the OpenAI Gym. Empirical evaluation results show that by integrating on-policy and off-policy update targets dynamically, this method exhibits superior performance and stability than does the exclusive use of the update target.

Download Full-text

Deriving Subgoals Autonomously to Accelerate Learning in Sparse Reward Domains

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301881 ◽

2019 ◽

Vol 33 ◽

pp. 881-889 ◽

Cited By ~ 1

Author(s):

Michael Dann ◽

Fabio Zambetta ◽

John Thangarajah

Keyword(s):

Reinforcement Learning ◽

Domain Knowledge ◽

State Of The Art ◽

Significant Challenge ◽

Intrinsic Reward ◽

Art Methods ◽

Efficient Exploration

Sparse reward games, such as the infamous Montezuma’s Revenge, pose a significant challenge for Reinforcement Learning (RL) agents. Hierarchical RL, which promotes efficient exploration via subgoals, has shown promise in these games. However, existing agents rely either on human domain knowledge or slow autonomous methods to derive suitable subgoals. In this work, we describe a new, autonomous approach for deriving subgoals from raw pixels that is more efficient than competing methods. We propose a novel intrinsic reward scheme for exploiting the derived subgoals, applying it to three Atari games with sparse rewards. Our agent’s performance is comparable to that of state-of-the-art methods, demonstrating the usefulness of the subgoals found.

Download Full-text

Beta Distribution Propagating Reinforcement Learning Based on Prospect Theory for the Efficient Exploration and Exploitation

Journal of Japan Society for Fuzzy Theory and Intelligent Informatics ◽

10.3156/jsoft.29.1_507 ◽

2017 ◽

Vol 29 (1) ◽

pp. 507-516 ◽

Cited By ~ 1

Author(s):

Akira NOTSU ◽

Seiki UBUKATA ◽

Katsuhiro HONDA

Keyword(s):

Reinforcement Learning ◽

Prospect Theory ◽

Beta Distribution ◽

Exploration And Exploitation ◽

Efficient Exploration

Download Full-text

Batch Reinforcement Learning of Feasible Trajectories in a Ship Maneuvering Simulator

10.5753/eniac.2018.4422 ◽

2018 ◽

Author(s):

José Amendola ◽

Eduardo A. Tannuri ◽

Fabio G. Cozman ◽

Anna H. Reali

Keyword(s):

Reinforcement Learning ◽

Learning Process ◽

Domain Knowledge ◽

State Space Model ◽

Ship Maneuvering ◽

Space Model ◽

Control Signals ◽

Batch Reinforcement Learning ◽

Ship Control ◽

Compact State

Ship control in port channels is a challenging problem that has resisted automated solutions. In this paper we focus on reinforcement learning of control signals so as to steer ships in their maneuvers. The learning process uses fitted Q iteration together with a Ship Maneuvering Simulator. Domain knowledge is used to develop a compact state-space model; we show how this model and the learning process lead to ship maneuvering under difficult conditions.

Download Full-text

Count-Based Exploration in Feature Space for Reinforcement Learning

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/344 ◽

2017 ◽

Cited By ~ 7

Author(s):

Jarryd Martin ◽

Suraj Narayanan S. ◽

Tom Everitt ◽

Marcus Hutter

Keyword(s):

Reinforcement Learning ◽

State Space ◽

Function Approximation ◽

Feature Space ◽

Feature Representation ◽

High Dimensional ◽

Training Experience ◽

Approximation Techniques ◽

State Action ◽

Efficient Exploration

We introduce a new count-based optimistic exploration algorithm for Reinforcement Learning (RL) that is feasible in environments with high-dimensional state-action spaces. The success of RL algorithms in these domains depends crucially on generalisation from limited training experience. Function approximation techniques enable RL agents to generalise in order to estimate the value of unvisited states, but at present few methods enable generalisation regarding uncertainty. This has prevented the combination of scalable RL algorithms with efficient exploration strategies that drive the agent to reduce its uncertainty. We present a new method for computing a generalised state visit-count, which allows the agent to estimate the uncertainty associated with any state. Our \phi-pseudocount achieves generalisation by exploiting same feature representation of the state space that is used for value function approximation. States that have less frequently observed features are deemed more uncertain. The \phi-Exploration-Bonus algorithm rewards the agent for exploring in feature space rather than in the untransformed state space. The method is simpler and less computationally expensive than some previous proposals, and achieves near state-of-the-art results on high-dimensional RL benchmarks.

Download Full-text

Hashing over Predicted Future Frames for Informed Exploration of Deep Reinforcement Learning

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/420 ◽

2018 ◽

Author(s):

Haiyan Yin ◽

Jianda Chen ◽

Sinno Jialin Pan

Keyword(s):

Reinforcement Learning ◽

Prediction Model ◽

High Dimensional ◽

State Action ◽

Dimensional Image ◽

The Future ◽

Convolutional Autoencoder ◽

Future Return ◽

Future Direction ◽

Efficient Exploration

In deep reinforcement learning (RL) tasks, an efficient exploration mechanism should be able to encourage an agent to take actions that lead to less frequent states which may yield higher accumulative future return. However, both knowing about the future and evaluating the frequentness of states are non-trivial tasks, especially for deep RL domains, where a state is represented by high-dimensional image frames. In this paper, we propose a novel informed exploration framework for deep RL, where we build the capability for an RL agent to predict over the future transitions and evaluate the frequentness for the predicted future frames in a meaningful manner. To this end, we train a deep prediction model to predict future frames given a state-action pair, and a convolutional autoencoder model to hash over the seen frames. In addition, to utilize the counts derived from the seen frames to evaluate the frequentness for the predicted frames, we tackle the challenge of matching the predicted future frames and their corresponding seen frames at the latent feature level. In this way, we derive a reliable metric for evaluating the novelty of the future direction pointed by each action, and hence inform the agent to explore the least frequent one.

Download Full-text

Safe Exploration of State and Action Spaces in Reinforcement Learning

Journal of Artificial Intelligence Research ◽

10.1613/jair.3761 ◽

2012 ◽

Vol 45 ◽

pp. 515-564 ◽

Cited By ~ 20

Author(s):

J. Garcia ◽

F. Fernandez

Keyword(s):

Reinforcement Learning ◽

Learning System ◽

Action Space ◽

High Dimensional ◽

State Action ◽

Continuous State ◽

Additional Challenge ◽

Efficient Exploration ◽

Action Spaces ◽

Selection Of

In this paper, we consider the important problem of safe exploration in reinforcement learning. While reinforcement learning is well-suited to domains with complex transition dynamics and high-dimensional state-action spaces, an additional challenge is posed by the need for safe and efficient exploration. Traditional exploration techniques are not particularly useful for solving dangerous tasks, where the trial and error process may lead to the selection of actions whose execution in some states may result in damage to the learning system (or any other system). Consequently, when an agent begins an interaction with a dangerous and high-dimensional state-action space, an important question arises; namely, that of how to avoid (or at least minimize) damage caused by the exploration of the state-action space. We introduce the PI-SRL algorithm which safely improves suboptimal albeit robust behaviors for continuous state and action control tasks and which efficiently learns from the experience gained from the environment. We evaluate the proposed method in four complex tasks: automatic car parking, pole-balancing, helicopter hovering, and business management.

Download Full-text

Deep Reinforcement Learning via Past-Success Directed Exploration

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019979 ◽

2019 ◽

Vol 33 ◽

pp. 9979-9980

Author(s):

Xiaoming Liu ◽

Zhixiong Xu ◽

Lei Cao ◽

Xiliang Chen ◽

Kai Kang

Keyword(s):

Online Learning ◽

Adaptive Control ◽

Reinforcement Learning ◽

Learning Process ◽

Control Method ◽

Learning Algorithms ◽

Action Selection ◽

Experimental Results ◽

Continuous Control ◽

Exploration And Exploitation

The balance between exploration and exploitation has always been a core challenge in reinforcement learning. This paper proposes “past-success exploration strategy combined with Softmax action selection”(PSE-Softmax) as an adaptive control method for taking advantage of the characteristics of the online learning process of the agent to adapt exploration parameters dynamically. The proposed strategy is tested on OpenAI Gym with discrete and continuous control tasks, and the experimental results show that PSE-Softmax strategy delivers better performance than deep reinforcement learning algorithms with basic exploration strategies.

Download Full-text

Relational Representations and Traces for Efficient Reinforcement Learning

Decision Theory Models for Applications in Artificial Intelligence ◽

10.4018/978-1-60960-165-2.ch009 ◽

2012 ◽

pp. 190-217

Author(s):

Eduardo F. Morales ◽

Julio H. Zaragoza

Keyword(s):

Reinforcement Learning ◽

Mobile Robot ◽

Learning Process ◽

Domain Knowledge ◽

Learning Algorithms ◽

Flight Simulator ◽

Small Subset ◽

First Order ◽

Search Spaces ◽

Order Relations

This chapter introduces an approach for reinforcement learning based on a relational representation that: (i) can be applied over large search spaces, (ii) can incorporate domain knowledge, and (iii) can use previously learned policies on different, but similar, problems. The underlying idea is to represent states as sets of first order relations, actions in terms of those relations, and to learn policies over such generalized representation. It is shown how this representation can produce powerful abstractions and that policies learned over this generalized representation can be directly applied, without any further learning, to other problems that can be characterized by the same set of relations. To accelerate the learning process, we present an extension where traces of the tasks to be learned are provided by the user. These traces are used to select only a small subset of possible actions increasing the convergence of the learning algorithms. The effectiveness of the approach is tested on a flight simulator and on a mobile robot.

Download Full-text

Generative Exploration and Exploitation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5858 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4337-4344

Author(s):

Jiechuan Jiang ◽

Zongqing Lu

Keyword(s):

Reinforcement Learning ◽

Prior Knowledge ◽

Single Agent ◽

Exploration And Exploitation ◽

Cooperative Navigation ◽

Multi Agent ◽

Novel Method

Sparse reward is one of the biggest challenges in reinforcement learning (RL). In this paper, we propose a novel method called Generative Exploration and Exploitation (GENE) to overcome sparse reward. GENE automatically generates start states to encourage the agent to explore the environment and to exploit received reward signals. GENE can adaptively tradeoff between exploration and exploitation according to the varying distributions of states experienced by the agent as the learning progresses. GENE relies on no prior knowledge about the environment and can be combined with any RL algorithm, no matter on-policy or off-policy, single-agent or multi-agent. Empirically, we demonstrate that GENE significantly outperforms existing methods in three tasks with only binary rewards, including Maze, Maze Ant, and Cooperative Navigation. Ablation studies verify the emergence of progressive exploration and automatic reversing.

Download Full-text

Reinforcement Learning Based Meta-Path Discovery in Large-Scale Heterogeneous Information Networks

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6073 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6094-6101

Author(s):

Guojia Wan ◽

Bo Du ◽

Shirui Pan ◽

Gholameza Haffari

Keyword(s):

Reinforcement Learning ◽

Domain Knowledge ◽

Large Scale ◽

Information Networks ◽

Superior Performance ◽

Heterogeneous Information ◽

Heterogeneous Information Networks ◽

Lowest Common Ancestor ◽

Path Discovery ◽

Meta Path

Meta-paths are important tools for a wide variety of data mining and network analysis tasks in Heterogeneous Information Networks (HINs), due to their flexibility and interpretability to capture the complex semantic relation among objects. To date, most HIN analysis still relies on hand-crafting meta-paths, which requires rich domain knowledge that is extremely difficult to obtain in complex, large-scale, and schema-rich HINs. In this work, we present a novel framework, Meta-path Discovery with Reinforcement Learning (MPDRL), to identify informative meta-paths from complex and large-scale HINs. To capture different semantic information between objects, we propose a novel multi-hop reasoning strategy in a reinforcement learning framework which aims to infer the next promising relation that links a source entity to a target entity. To improve the efficiency, moreover, we develop a type context representation embedded approach to scale the RL framework to handle million-scale HINs. As multi-hop reasoning generates rich meta-paths with various length, we further perform a meta-path induction step to summarize the important meta-paths using Lowest Common Ancestor principle. Experimental results on two large-scale HINs, Yago and NELL, validate our approach and demonstrate that our algorithm not only achieves superior performance in the link prediction task, but also identifies useful meta-paths that would have been ignored by human experts.

Download Full-text