value iteration
Recently Published Documents





2022 ◽  
Vol 121 ◽  
pp. 105042
Yi Jiang ◽  
Weinan Gao ◽  
Jing Na ◽  
Di Zhang ◽  
Timo T. Hämäläinen ◽  

Arunselvan Ramaswamy ◽  
Shalabh Bhatnagar

In this paper, we consider the stochastic iterative counterpart of the value iteration scheme wherein only noisy and possibly biased approximations of the Bellman operator are available. We call this counterpart the approximate value iteration (AVI) scheme. Neural networks are often used as function approximators, in order to counter Bellman’s curse of dimensionality. In this paper, they are used to approximate the Bellman operator. Because neural networks are typically trained using sample data, errors and biases may be introduced. The design of AVI accounts for implementations with biased approximations of the Bellman operator and sampling errors. We present verifiable sufficient conditions under which AVI is stable (almost surely bounded) and converges to a fixed point of the approximate Bellman operator. To ensure the stability of AVI, we present three different yet related sets of sufficient conditions that are based on the existence of an appropriate Lyapunov function. These Lyapunov function–based conditions are easily verifiable and new to the literature. The verifiability is enhanced by the fact that a recipe for the construction of the necessary Lyapunov function is also provided. We also show that the stability analysis of AVI can be readily extended to the general case of set-valued stochastic approximations. Finally, we show that AVI can also be used in more general circumstances, that is, for finding fixed points of contractive set-valued maps.

Sensors ◽  
2021 ◽  
Vol 21 (24) ◽  
pp. 8418
Xiang Jin ◽  
Wei Lan ◽  
Tianlin Wang ◽  
Pengyao Yu

Path planning technology is significant for planetary rovers that perform exploration missions in unfamiliar environments. In this work, we propose a novel global path planning algorithm, based on the value iteration network (VIN), which is embedded within a differentiable planning module, built on the value iteration (VI) algorithm, and has emerged as an effective method to learn to plan. Despite the capability of learning environment dynamics and performing long-range reasoning, the VIN suffers from several limitations, including sensitivity to initialization and poor performance in large-scale domains. We introduce the double value iteration network (dVIN), which decouples action selection and value estimation in the VI module, using the weighted double estimator method to approximate the maximum expected value, instead of maximizing over the estimated action value. We have devised a simple, yet effective, two-stage training strategy for VI-based models to address the problem of high computational cost and poor performance in large-size domains. We evaluate the dVIN on planning problems in grid-world domains and realistic datasets, generated from terrain images of a moon landscape. We show that our dVIN empirically outperforms the baseline methods and generalize better to large-scale environments.

Yang Liu ◽  
Zhanpeng Jiang ◽  
Lichao Hao ◽  
Zuoxia Xing ◽  
Mingyang Chen ◽  

Nicole Bäuerle ◽  
Alexander Glauner

We consider robust Markov decision processes with Borel state and action spaces, unbounded cost, and finite time horizon. Our formulation leads to a Stackelberg game against nature. Under integrability, continuity, and compactness assumptions, we derive a robust cost iteration for a fixed policy of the decision maker and a value iteration for the robust optimization problem. Moreover, we show the existence of deterministic optimal policies for both players. This is in contrast to classical zero-sum games. In case the state space is the real line, we show under some convexity assumptions that the interchange of supremum and infimum is possible with the help of Sion’s minimax theorem. Further, we consider the problem with special ambiguity sets. In particular, we are able to derive some cases where the robust optimization problem coincides with the minimization of a coherent risk measure. In the final section, we discuss two applications: a robust linear-quadratic problem and a robust problem for managing regenerative energy.

Sign in / Sign up

Export Citation Format

Share Document