scholarly journals Adaptive Thompson Sampling Stacks for Memory Bounded Open-Loop Planning

Author(s):  
Thomy Phan ◽  
Thomas Gabor ◽  
Robert Müller ◽  
Christoph Roch ◽  
Claudia Linnhoff-Popien

We propose Stable Yet Memory Bounded Open-Loop (SYMBOL) planning, a general memory bounded approach to partially observable open-loop planning. SYMBOL maintains an adaptive stack of Thompson Sampling bandits, whose size is bounded by the planning horizon and can be automatically adapted according to the underlying domain without any prior domain knowledge beyond a generative model. We empirically test SYMBOL in four large POMDP benchmark problems to demonstrate its effectiveness and robustness w.r.t. the choice of hyperparameters and evaluate its adaptive memory consumption. We also compare its performance with other open-loop planning algorithms and POMCP.

Author(s):  
Thomy Phan ◽  
Lenz Belzner ◽  
Marie Kiermeier ◽  
Markus Friedrich ◽  
Kyrill Schmid ◽  
...  

State-of-the-art approaches to partially observable planning like POMCP are based on stochastic tree search. While these approaches are computationally efficient, they may still construct search trees of considerable size, which could limit the performance due to restricted memory resources. In this paper, we propose Partially Observable Stacked Thompson Sampling (POSTS), a memory bounded approach to openloop planning in large POMDPs, which optimizes a fixed size stack of Thompson Sampling bandits. We empirically evaluate POSTS in four large benchmark problems and compare its performance with different tree-based approaches. We show that POSTS achieves competitive performance compared to tree-based open-loop planning and offers a performancememory tradeoff, making it suitable for partially observable planning with highly restricted computational and memory resources.


Author(s):  
Jan Leike ◽  
Tor Lattimore ◽  
Laurent Orseau ◽  
Marcus Hutter

We discuss some recent results on Thompson sampling for nonparametric reinforcement learning in countable classes of general stochastic environments. These environments can be non-Markovian, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges in mean to the optimal value and (2) given a recoverability assumption regret is sublinear. We conclude with a discussion about optimality in reinforcement learning.


Author(s):  
Gayathri Rajendran ◽  
Uma Vijayasundaram

Robotics has become a rapidly emerging branch of science, addressing the needs of humankind by way of advanced technique, like artificial intelligence (AI). This chapter gives detailed explanation about the background knowledge required in implementing the software robots. This chapter has an in-depth explanation about different types of software robots with respect to different applications. This chapter would also highlight some of the important contributions made in this field. Path planning algorithms are required for performing robot navigation efficiently. This chapter discusses several robot path planning algorithms which help in utilizing the domain knowledge, avoiding the possible obstacles, and successfully accomplishing the tasks in lesser computational time. This chapter would also provide a case study on robot navigation data and explain the significant of machine learning algorithms in decision making. This chapter would also discuss some of the potential simulators used in implementing software robots.


Author(s):  
Erwan Lecarpentier ◽  
Guillaume Infantes ◽  
Charles Lesire ◽  
Emmanuel Rachelson

In the context of tree-search stochastic planning algorithms where a generative model is available, we consider on-line planning algorithms building trees in order to recommend an action. We investigate the question of avoiding re-planning in subsequent decision steps by directly using sub-trees as action recommender. Firstly, we propose a method for open loop control via a new algorithm taking the decision of re-planning or not at each time step based on an analysis of the statistics of the sub-tree. Secondly, we show that the probability of selecting a suboptimal action at any depth of the tree can be upper bounded and converges towards zero. Moreover, this upper bound decays in a logarithmic way between subsequent depths. This leads to a distinction between node-wise optimality and state-wise optimality. Finally, we empirically demonstrate that our method achieves a compromise between loss of performance and computational gain.


2018 ◽  
Vol 66 (6) ◽  
pp. 1586-1602 ◽  
Author(s):  
Kris Johnson Ferreira ◽  
David Simchi-Levi ◽  
He Wang

Thompson sampling is a randomized Bayesian machine learning method, whose original motivation was to sequentially evaluate treatments in clinical trials. In recent years, this method has drawn wide attention, as Internet companies have successfully implemented it for online ad display. In “Online network revenue management using Thompson sampling,” K. Ferreira, D. Simchi-Levi, and H. Wang propose using Thompson sampling for a revenue management problem where the demand function is unknown. A main challenge to adopt Thompson sampling for revenue management is that the original method does not incorporate inventory constraints. However, the authors show that Thompson sampling can be naturally combined with a linear program formulation to include inventory constraints. The result is a dynamic pricing algorithm that incorporates domain knowledge and has strong theoretical performance guarantees as well as promising numerical performance results. Interestingly, the authors demonstrate that Thompson sampling achieves poor performance when it does not take into account domain knowledge. Finally, the proposed dynamic pricing algorithm is highly flexible and is applicable in a range of industries, from airlines and internet advertising all the way to online retailing.


Author(s):  
Majid Khonji ◽  
Ashkan Jasour ◽  
Brian Williams

Partially Observable Markov Decision Process (POMDP) is a fundamental framework for planning and decision making under uncertainty. POMDP is known to be intractable to solve or even approximate when the planning horizon is long (i.e., within a polynomial number of time steps). Constrained POMDP (C-POMDP) allows constraints to be specified on some aspects of the policy in addition to the objective function. When the constraints involve bounding the probability of failure, the problem is called Chance-Constrained POMDP (CC-POMDP). Our first contribution is a reduction from CC-POMDP to C-POMDP and a novel Integer Linear Programming (ILP) formulation. Thus, any algorithm for the later problem can be utilized to solve any instance of the former. Second, we show that unlike POMDP, when the length of the planning horizon is constant, (C)C-POMDP is NP-Hard. Third, we present the first Fully Polynomial Time Approximation Scheme (FPTAS) that computes (near) optimal deterministic policies for constant-horizon (C)C-POMDP in polynomial time.


Sign in / Sign up

Export Citation Format

Share Document