A Model of External Memory for Navigation in Partially Observable Visual Reinforcement Learning Tasks

AbstractComplex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks given resource collection constraints are of great interest. In this paper, we formulate the task-specific network discovery problem as a sequential decision-making problem. Our downstream task is selective harvesting, the optimal collection of vertices with a particular attribute. We propose a framework, called network actor critic (NAC), which learns a policy and notion of future reward in an offline setting via a deep reinforcement learning algorithm. The NAC paradigm utilizes a task-specific network embedding to reduce the state space complexity. A detailed comparative analysis of popular network embeddings is presented with respect to their role in supporting offline planning. Furthermore, a quantitative study is presented on various synthetic and real benchmarks using NAC and several baselines. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms. Finally, we outline learning regimes where planning is critical in addressing sparse and changing reward signals.

Download Full-text

Partially observable environment estimation with uplift inference for reinforcement learning based recommendation

Machine Learning ◽

10.1007/s10994-021-05969-w ◽

2021 ◽

Author(s):

Wenjie Shang ◽

Qingyang Li ◽

Zhiwei Qin ◽

Yang Yu ◽

Yiping Meng ◽

...

Keyword(s):

Reinforcement Learning ◽

Partially Observable

Download Full-text

On Thompson Sampling and Asymptotic Optimality

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/688 ◽

2017 ◽

Cited By ~ 3

Author(s):

Jan Leike ◽

Tor Lattimore ◽

Laurent Orseau ◽

Marcus Hutter

Keyword(s):

Reinforcement Learning ◽

Asymptotic Optimality ◽

Thompson Sampling ◽

Stochastic Environments ◽

Optimal Value ◽

Partially Observable ◽

General Stochastic

We discuss some recent results on Thompson sampling for nonparametric reinforcement learning in countable classes of general stochastic environments. These environments can be non-Markovian, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges in mean to the optimal value and (2) given a recoverability assumption regret is sublinear. We conclude with a discussion about optimality in reinforcement learning.

Download Full-text

Playing a FPS Doom Video Game with Deep Visual Reinforcement Learning

Automatic Control and Computer Sciences ◽

10.3103/s0146411619030052 ◽

2019 ◽

Vol 53 (3) ◽

pp. 214-222

Author(s):

Adil Khan ◽

Feng Jiang ◽

Shaohui Liu ◽

Ibrahim Omara

Keyword(s):

Reinforcement Learning ◽

Video Game ◽

Visual Reinforcement

Download Full-text

Pseudo Random Number Generation through Reinforcement Learning and Recurrent Neural Networks

Algorithms ◽

10.3390/a13110307 ◽

2020 ◽

Vol 13 (11) ◽

pp. 307

Author(s):

Luca Pasqualini ◽

Maurizio Parton

Keyword(s):

Reinforcement Learning ◽

Random Number ◽

Short Term Memory ◽

Random Number Generator ◽

Random Number Generation ◽

Time Step ◽

Software Applications ◽

Pseudo Random Number ◽

Markov Decision ◽

Partially Observable

A Pseudo-Random Number Generator (PRNG) is any algorithm generating a sequence of numbers approximating properties of random numbers. These numbers are widely employed in mid-level cryptography and in software applications. Test suites are used to evaluate the quality of PRNGs by checking statistical properties of the generated sequences. These sequences are commonly represented bit by bit. This paper proposes a Reinforcement Learning (RL) approach to the task of generating PRNGs from scratch by learning a policy to solve a partially observable Markov Decision Process (MDP), where the full state is the period of the generated sequence, and the observation at each time-step is the last sequence of bits appended to such states. We use Long-Short Term Memory (LSTM) architecture to model the temporal relationship between observations at different time-steps by tasking the LSTM memory with the extraction of significant features of the hidden portion of the MDP’s states. We show that modeling a PRNG with a partially observable MDP and an LSTM architecture largely improves the results of the fully observable feedforward RL approach introduced in previous work.

Download Full-text

Adaptive Quantitative Trading: An Imitative Deep Reinforcement Learning Approach

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i02.5587 ◽

2020 ◽

Vol 34 (02) ◽

pp. 2128-2135

Author(s):

Yang Liu ◽

Qi Liu ◽

Hongke Zhao ◽

Zhen Pan ◽

Chuanren Liu

Keyword(s):

Reinforcement Learning ◽

Trading Strategies ◽

Financial Data ◽

Imitation Learning ◽

Market Condition ◽

Exploration And Exploitation ◽

Markov Decision ◽

Trading Model ◽

Trading Agent ◽

Partially Observable

In recent years, considerable efforts have been devoted to developing AI techniques for finance research and applications. For instance, AI techniques (e.g., machine learning) can help traders in quantitative trading (QT) by automating two tasks: market condition recognition and trading strategies execution. However, existing methods in QT face challenges such as representing noisy high-frequent financial data and finding the balance between exploration and exploitation of the trading agent with AI techniques. To address the challenges, we propose an adaptive trading model, namely iRDPG, to automatically develop QT strategies by an intelligent trading agent. Our model is enhanced by deep reinforcement learning (DRL) and imitation learning techniques. Specifically, considering the noisy financial data, we formulate the QT process as a Partially Observable Markov Decision Process (POMDP). Also, we introduce imitation learning to leverage classical trading strategies useful to balance between exploration and exploitation. For better simulation, we train our trading agent in the real financial market using minute-frequent data. Experimental results demonstrate that our model can extract robust market features and be adaptive in different markets.

Download Full-text

Differential contributions of worry, anxiety, and obsessive compulsive symptoms to ERN amplitudes in response monitoring and reinforcement learning tasks

Neuropsychologia ◽

10.1016/j.neuropsychologia.2014.06.023 ◽

2014 ◽

Vol 61 ◽

pp. 197-209 ◽

Cited By ~ 26

Author(s):

Laura Zambrano-Vazquez ◽

John J.B. Allen

Keyword(s):

Reinforcement Learning ◽

Response Monitoring ◽

Obsessive Compulsive ◽

Obsessive Compulsive Symptoms ◽

Learning Tasks

Download Full-text

A Policy Search Method For Temporal Logic Specified Reinforcement Learning Tasks

2018 Annual American Control Conference (ACC) ◽

10.23919/acc.2018.8431181 ◽

2018 ◽

Cited By ~ 9

Author(s):

Xiao Li ◽

Yao Ma ◽

Calin Belta

Keyword(s):

Reinforcement Learning ◽

Temporal Logic ◽

Search Method ◽

Policy Search ◽

Learning Tasks

Download Full-text

Abstraction in Model Based Partially Observable Reinforcement Learning Using Extended Sequence Trees

2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology ◽

10.1109/wi-iat.2012.161 ◽

2012 ◽

Cited By ~ 1

Author(s):

Erkin Cilden ◽

Faruk Polat

Keyword(s):

Reinforcement Learning ◽

Model Based ◽

Extended Sequence ◽

Partially Observable

Download Full-text