Automatic landmark discovery for learning agents under partial observability

2019 ◽  
Vol 34 ◽  
Author(s):  
Alper Demіr ◽  
Erkіn Çіlden ◽  
Faruk Polat

Abstract In the reinforcement learning context, a landmark is a compact information which uniquely couples a state, for problems with hidden states. Landmarks are shown to support finding good memoryless policies for Partially Observable Markov Decision Processes (POMDP) which contain at least one landmark. SarsaLandmark, as an adaptation of Sarsa(λ), is known to promise a better learning performance with the assumption that all landmarks of the problem are known in advance. In this paper, we propose a framework built upon SarsaLandmark, which is able to automatically identify landmarks within the problem during learning without sacrificing quality, and requiring no prior information about the problem structure. For this purpose, the framework fuses SarsaLandmark with a well-known multiple-instance learning algorithm, namely Diverse Density (DD). By further experimentation, we also provide a deeper insight into our concept filtering heuristic to accelerate DD, abbreviated as DDCF (Diverse Density with Concept Filtering), which proves itself to be suitable for POMDPs with landmarks. DDCF outperforms its antecedent in terms of computation speed and solution quality without loss of generality. The methods are empirically shown to be effective via extensive experimentation on a number of known and newly introduced problems with hidden state, and the results are discussed.

2020 ◽  
Vol 13 (4) ◽  
pp. 78
Author(s):  
Nico Zengeler ◽  
Uwe Handmann

We present a deep reinforcement learning framework for an automatic trading of contracts for difference (CfD) on indices at a high frequency. Our contribution proves that reinforcement learning agents with recurrent long short-term memory (LSTM) networks can learn from recent market history and outperform the market. Usually, these approaches depend on a low latency. In a real-world example, we show that an increased model size may compensate for a higher latency. As the noisy nature of economic trends complicates predictions, especially in speculative assets, our approach does not predict courses but instead uses a reinforcement learning agent to learn an overall lucrative trading policy. Therefore, we simulate a virtual market environment, based on historical trading data. Our environment provides a partially observable Markov decision process (POMDP) to reinforcement learners and allows the training of various strategies.


2015 ◽  
Vol 25 (3) ◽  
pp. 597-615 ◽  
Author(s):  
Hideaki Itoh ◽  
Hisao Fukumoto ◽  
Hiroshi Wakuya ◽  
Tatsuya Furukawa

AbstractThe theory of partially observable Markov decision processes (POMDPs) is a useful tool for developing various intelligent agents, and learning hierarchical POMDP models is one of the key approaches for building such agents when the environments of the agents are unknown and large. To learn hierarchical models, bottom-up learning methods in which learning takes place in a layer-by-layer manner from the lowest to the highest layer are already extensively used in some research fields such as hidden Markov models and neural networks. However, little attention has been paid to bottom-up approaches for learning POMDP models. In this paper, we present a novel bottom-up learning algorithm for hierarchical POMDP models and prove that, by using this algorithm, a perfect model (i.e., a model that can perfectly predict future observations) can be learned at least in a class of deterministic POMDP environments


2009 ◽  
Vol 18 (08) ◽  
pp. 1517-1531 ◽  
Author(s):  
TAKASHI KUREMOTO ◽  
YUKI YAMANO ◽  
MASANAO OBAYASHI ◽  
KUNIKAZU KOBAYASHI

To form a swarm and acquire swarm behaviors adaptive to the environment, we proposed a neuro-fuzzy learning system as a common internal model of each individual recently. The proposed swarm behavior learning system showed its efficient accomplishment in the simulation experiments of goal-exploration problems. However, the input information observed from the environment in our conventional methods was given by coordinate spaces (discrete or continuous) which were difficult to be obtained in the real world by the individuals. This paper intends to improve our previous neuro-fuzzy learning system to deal with the local-limited observation, i.e., usually being a Partially Observable Markov Decision Process (POMDP), by adopting eligibility traces and balancing trade-off between exploration and exploitation to the conventional learning algorithm. Simulations of goal-oriented problems for swarm learning were executed and the results showed the effectiveness of the improved learning system.


2021 ◽  
Vol 12 (1) ◽  
pp. 71-86
Author(s):  
Marcus Hutter

Abstract The Feature Markov Decision Processes ( MDPs) model developed in Part I (Hutter, 2009b) is well-suited for learning agents in general environments. Nevertheless, unstructured (Φ)MDPs are limited to relatively simple environments. Structured MDPs like Dynamic Bayesian Networks (DBNs) are used for large-scale real-world problems. In this article I extend ΦMDP to ΦDBN. The primary contribution is to derive a cost criterion that allows to automatically extract the most relevant features from the environment, leading to the “best” DBN representation. I discuss all building blocks required for a complete general learning algorithm, and compare the novel ΦDBN model to the prevalent POMDP approach.


Author(s):  
Peixi Peng ◽  
Junliang Xing ◽  
Lili Cao ◽  
Lisen Mu ◽  
Chang Huang

The task of real-time combat game is to coordinate multiple units to defeat their enemies controlled by the given opponent in a real-time combat scenario. It is difficult to design a high-level Artificial Intelligence (AI) program for such a task due to its extremely large state-action space and real-time requirements. This paper formulates this task as a collective decentralized partially observable Markov decision process, and designs a Deep Decentralized Policy Network (DDPN) to model the polices. To train DDPN effectively, a novel two-stage learning algorithm is proposed which combines imitation learning from opponent and reinforcement learning by no-regret dynamics. Extensive experimental results on various combat scenarios indicate that proposed method can defeat different opponent models and significantly outperforms many state-of-the-art approaches.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Koosha Khalvati ◽  
Roozbeh Kiani ◽  
Rajesh P. N. Rao

AbstractIn perceptual decisions, subjects infer hidden states of the environment based on noisy sensory information. Here we show that both choice and its associated confidence are explained by a Bayesian framework based on partially observable Markov decision processes (POMDPs). We test our model on monkeys performing a direction-discrimination task with post-decision wagering, demonstrating that the model explains objective accuracy and predicts subjective confidence. Further, we show that the model replicates well-known discrepancies of confidence and accuracy, including the hard-easy effect, opposing effects of stimulus variability on confidence and accuracy, dependence of confidence ratings on simultaneous or sequential reports of choice and confidence, apparent difference between choice and confidence sensitivity, and seemingly disproportionate influence of choice-congruent evidence on confidence. These effects may not be signatures of sub-optimal inference or discrepant computational processes for choice and confidence. Rather, they arise in Bayesian inference with incomplete knowledge of the environment.


2020 ◽  
Author(s):  
Koosha Khalvati ◽  
Roozbeh Kiani ◽  
Rajesh P. N. Rao

AbstractIn perceptual decisions, subjects infer hidden states of the environment based on noisy sensory information. Here we show that both choice and its associated confidence are explained by a Bayesian framework based on partially observable Markov decision processes (POMDPs). We test our model on monkeys performing a direction-discrimination task with post-decision wagering, demonstrating that the model explains objective accuracy and predicts subjective confidence. Further, we show that the model replicates well-known discrepancies of confidence and accuracy, including the hard-easy effect, opposing effects of stimulus volatility on confidence and accuracy, dependence of confidence ratings on simultaneous or sequential reports of choice and confidence, apparent difference between choice and confidence sensitivity, and seemingly disproportionate influence of choice-congruent evidence on confidence. These effects may not be signatures of sub-optimal inference or discrepant computational processes for choice and confidence. Rather, they arise in Bayesian inference with incomplete knowledge of the environment.


Sign in / Sign up

Export Citation Format

Share Document