scholarly journals Fast and slow curiosity for high-level exploration in reinforcement learning

Author(s):  
Nicolas Bougie ◽  
Ryutaro Ichise

Abstract Deep reinforcement learning (DRL) algorithms rely on carefully designed environment rewards that are extrinsic to the agent. However, in many real-world scenarios rewards are sparse or delayed, motivating the need for discovering efficient exploration strategies. While intrinsically motivated agents hold promise of better local exploration, solving problems that require coordinated decisions over long-time horizons remains an open problem. We postulate that to discover such strategies, a DRL agent should be able to combine local and high-level exploration behaviors. To this end, we introduce the concept of fast and slow curiosity that aims to incentivize long-time horizon exploration. Our method decomposes the curiosity bonus into a fast reward that deals with local exploration and a slow reward that encourages global exploration. We formulate this bonus as the error in an agent’s ability to reconstruct the observations given their contexts. We further propose to dynamically weight local and high-level strategies by measuring state diversity. We evaluate our method on a variety of benchmark environments, including Minigrid, Super Mario Bros, and Atari games. Experimental results show that our agent outperforms prior approaches in most tasks in terms of exploration efficiency and mean scores.

Author(s):  
Nicolas Bougie ◽  
Ryutaro Ichise

Deep reinforcement learning (DRL) methods traditionally struggle with tasks where environment rewards are sparse or delayed, which entails that exploration remains one of the key challenges of DRL. Instead of solely relying on extrinsic rewards, many state-of-the-art methods use intrinsic curiosity as exploration signal. While they hold promise of better local exploration, discovering global exploration strategies is beyond the reach of current methods. We propose a novel end-to-end intrinsic reward formulation that introduces high-level exploration in reinforcement learning. Our curiosity signal is driven by a fast reward that deals with local exploration and a slow reward that incentivizes long-time horizon exploration strategies. We formulate curiosity as the error in an agent’s ability to reconstruct the observations given their contexts. Experimental results show that this high-level exploration enables our agents to outperform prior work in several Atari games.


Entropy ◽  
2020 ◽  
Vol 22 (8) ◽  
pp. 830
Author(s):  
Yuan Feng ◽  
Baoan Ren ◽  
Chengyi Zeng ◽  
Yuyuan Yang ◽  
Hongfu Liu

Network disintegration has been an important research hotspot in complex networks for a long time. From the perspective of node attack, researchers have devoted to this field and carried out numerous works. In contrast, the research on edge attack strategy is insufficient. This paper comprehensively evaluates the disintegration effect of each structural similarity index when they are applied to the weighted-edge attacks model. Experimental results show that the edge attack strategy based on a single similarity index will appear limited stability and adaptability. Thus, motivated by obtaining a stable disintegration effect, this paper designs an edge attack strategy based on the ordered weighted averaging (OWA) operator. Through final experimental results, we found that the edge attack strategy proposed in this paper not only achieves a more stable disintegration effect on eight real-world networks, but also significantly improves the disintegration effect when applied on a single network in comparison with the original similarity index.


Author(s):  
Jun Xu ◽  
Zeyang Lei ◽  
Haifeng Wang ◽  
Zheng-Yu Niu ◽  
Hua Wu ◽  
...  

How to generate informative, coherent and sustainable open-domain conversations is a non-trivial task. Previous work on knowledge grounded conversation generation focus on improving dialog informativeness with little attention on dialog coherence. In this paper, to enhance multi-turn dialog coherence, we propose to leverage event chains to help determine a sketch of a multi-turn dialog. We first extract event chains from narrative texts and connect them as a graph. We then present a novel event graph grounded Reinforcement Learning (RL) framework. It conducts high-level response content (simply an event) planning by learning to walk over the graph, and then produces a response conditioned on the planned content. In particular, we devise a novel multi-policy decision making mechanism to foster a coherent dialog with both appropriate content ordering and high contextual relevance. Experimental results indicate the effectiveness of this framework in terms of dialog coherence and informativeness.


Author(s):  
Hui Xu ◽  
Chong Zhang ◽  
Jiaxing Wang ◽  
Deqiang Ouyang ◽  
Yu Zheng ◽  
...  

Efficient exploration is a major challenge in Reinforcement Learning (RL) and has been studied extensively. However, for a new task existing methods explore either by taking actions that maximize task agnostic objectives (such as information gain) or applying a simple dithering strategy (such as noise injection), which might not be effective enough. In this paper, we investigate whether previous learning experiences can be leveraged to guide exploration of current new task. To this end, we propose a novel Exploration with Structured Noise in Parameter Space (ESNPS) approach. ESNPS utilizes meta-learning and directly uses meta-policy parameters, which contain prior knowledge, as structured noises to perturb the base model for effective exploration in new tasks. Experimental results on four groups of tasks: cheetah velocity, cheetah direction, ant velocity and ant direction demonstrate the superiority of ESNPS against a number of competitive baselines.


2020 ◽  
Vol 34 (05) ◽  
pp. 7927-7934
Author(s):  
Zhengqiu He ◽  
Wenliang Chen ◽  
Yuyi Wang ◽  
Wei Zhang ◽  
Guanchun Wang ◽  
...  

We present a novel approach to improve the performance of distant supervision relation extraction with Positive and Unlabeled (PU) Learning. This approach first applies reinforcement learning to decide whether a sentence is positive to a given relation, and then positive and unlabeled bags are constructed. In contrast to most previous studies, which mainly use selected positive instances only, we make full use of unlabeled instances and propose two new representations for positive and unlabeled bags. These two representations are then combined in an appropriate way to make bag-level prediction. Experimental results on a widely used real-world dataset demonstrate that this new approach indeed achieves significant and consistent improvements as compared to several competitive baselines.


2008 ◽  
Vol 41 (7) ◽  
pp. 971-1000 ◽  
Author(s):  
Joseph Wright

In this article, the author argues that the time horizon a dictator faces affects his incentives over the use of aid in three ways. First, dictators have a greater incentive to invest in public goods when they have a long time horizon. Second, dictators with short time horizons often face the threat of challengers to the regime; this leads them to forgo investment and instead consume state resources in two forms that harm growth: repression and private pay-offs to political opponents. Third, dictators with short time horizons have a strong incentive to secure personal wealth as a form of insurance in case the regime falls. Using panel data on dictatorships in 71 developing countries from 1961 to 2001, the author finds that time horizons have a positive impact on aid effectiveness: Foreign aid is associated with positive growth when dictators face long time horizons and negative growth when time horizons are short.


Author(s):  
Rundong Wang ◽  
Runsheng Yu ◽  
Bo An ◽  
Zinovi Rabinovich

Hierarchical reinforcement learning (HRL) is a promising approach to solve tasks with long time horizons and sparse rewards. It is often implemented as a high-level policy assigning subgoals to a low-level policy. However, it suffers the high-level non-stationarity problem since the low-level policy is constantly changing. The non-stationarity also leads to the data efficiency problem: policies need more data at non-stationary states to stabilize training. To address these issues, we propose a novel HRL method: Interactive Influence-based Hierarchical Reinforcement Learning (I^2HRL). First, inspired by agent modeling, we enable the interaction between the low-level and high-level policies to stabilize the high-level policy training. The high-level policy makes decisions conditioned on the received low-level policy representation as well as the state of the environment. Second, we furthermore stabilize the high-level policy via an information-theoretic regularization with minimal dependence on the changing low-level policy. Third, we propose the influence-based exploration to more frequently visit the non-stationary states where more transition data is needed. We experimentally validate the effectiveness of the proposed solution in several tasks in MuJoCo domains by demonstrating that our approach can significantly boost the learning performance and accelerate learning compared with state-of-the-art HRL methods.


GeroPsych ◽  
2018 ◽  
Vol 31 (3) ◽  
pp. 151-162 ◽  
Author(s):  
Qiao Chu ◽  
Daniel Grühn ◽  
Ashley M. Holland

Abstract. We investigated the effects of time horizon and age on the socioemotional motives underlying individual’s bucket-list goals. Participants were randomly assigned to one of three time-horizon conditions to make a bucket list: (1) an open-ended time horizon (Study 1 & 2), (2) a 6-month horizon (i.e., “Imagine you have 6 months to live”; Study 1 & 2), and (3) a 1-week horizon (Study 2). Goal motives were coded based on socioemotional selectivity theory and psychosocial development theory. Results indicated that time horizon and age produced unique effects on bucket-list goal motives. Extending past findings on people’s motives considering the end of life, the findings suggest that different time horizons and life stages trigger different motives.


1973 ◽  
Vol 12 (1) ◽  
pp. 1-30
Author(s):  
Syed Nawab Haider Naqvi

The recent uncertainties about aid flows have underscored the need for achieving an early independence from foreign aid. The Perspective Plan (1,965-85) had envisaged the termination of Pakistan's dependence on foreign aid by 1985. However, in the context of West Pakistan alone the time horizon can now be advanced by several years with considerable confidence in its economy to pull the trick. The difficulties of achieving independence from foreign aid can be seen by reference to the fact that aid flows make it possible for the policy-maker to pursue such ostensibly incompatible objectives as a balance in international payments (i.e., foreign aid finances the balance of payments), higher rates of economic growth (Lei, it pulls up domestic saving and investment levels), a high level of employment (i.e., it keeps the industries working at a fuller capacity than would otherwise be the case), and a reasonably stable price level (i.e., it lets a higher level of imports than would otherwise be possible). Without aid, then a simultaneous attainment of all these objectives at the former higher levels together with the balance in foreign payments may become well-nigh impos¬sible. Choices are, therefore, inevitable not for definite places in the hierarchy of values, but rather for occasional "trade-offs". That is to say, we will have to" choose how much to sacrifice for the attainment of one goal for the sake of somewhat better realization of another.


Sign in / Sign up

Export Citation Format

Share Document