Statistical inference of the value function for reinforcement learning in infinite‐horizon settings

In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.

Download Full-text

Autonomous underwater vehicle path planning based on actor-multi-critic reinforcement learning

Proceedings of the Institution of Mechanical Engineers Part I Journal of Systems and Control Engineering ◽

10.1177/0959651820937085 ◽

2020 ◽

pp. 095965182093708

Author(s):

Zhuo Wang ◽

Shiwei Zhang ◽

Xiaoning Feng ◽

Yancheng Sui

Keyword(s):

Reinforcement Learning ◽

Path Planning ◽

Value Function ◽

Autonomous Underwater Vehicle ◽

Autonomous Underwater Vehicles ◽

Underwater Vehicle ◽

Learning Efficiency ◽

Environmental Adaptability ◽

Vehicle Path ◽

The Value Function

The environmental adaptability of autonomous underwater vehicles is always a problem for its path planning. Although reinforcement learning can improve the environmental adaptability, the slow convergence of reinforcement learning is caused by multi-behavior coupling, so it is difficult for autonomous underwater vehicle to avoid moving obstacles. This article proposes a multi-behavior critic reinforcement learning algorithm applied to autonomous underwater vehicle path planning to overcome problems associated with oscillating amplitudes and low learning efficiency in the early stages of training which are common in traditional actor–critic algorithms. Behavior critic reinforcement learning assesses the actions of the actor from perspectives such as energy saving and security, combining these aspects into a whole evaluation of the actor. In this article, the policy gradient method is selected as the actor part, and the value function method is selected as the critic part. The strategy gradient and the value function methods for actor and critic, respectively, are approximated by a backpropagation neural network, the parameters of which are updated using the gradient descent method. The simulation results show that the method has the ability of optimizing learning in the environment and can improve learning efficiency, which meets the needs of real time and adaptability for autonomous underwater vehicle dynamic obstacle avoidance.

Download Full-text

Generation of Policy-Level Explanations for Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33012514 ◽

2019 ◽

Vol 33 ◽

pp. 2514-2521

Author(s):

Nicholay Topin ◽

Manuela Veloso

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Reinforcement Learning ◽

Markov Chains ◽

Time Complexity ◽

Value Function ◽

Worst Case ◽

Policy Level ◽

Individual Decisions ◽

The Value Function

Though reinforcement learning has greatly benefited from the incorporation of neural networks, the inability to verify the correctness of such systems limits their use. Current work in explainable deep learning focuses on explaining only a single decision in terms of input features, making it unsuitable for explaining a sequence of decisions. To address this need, we introduce Abstracted Policy Graphs, which are Markov chains of abstract states. This representation concisely summarizes a policy so that individual decisions can be explained in the context of expected future transitions. Additionally, we propose a method to generate these Abstracted Policy Graphs for deterministic policies given a learned value function and a set of observed transitions, potentially off-policy transitions used during training. Since no restrictions are placed on how the value function is generated, our method is compatible with many existing reinforcement learning methods. We prove that the worst-case time complexity of our method is quadratic in the number of features and linear in the number of provided transitions, O(|F|2|tr samples|). By applying our method to a family of domains, we show that our method scales well in practice and produces Abstracted Policy Graphs which reliably capture relationships within these domains.

Download Full-text

The Value Function as a Solution of Hamiltonian Systems in Linear Optimal Control Problems with Infinite Horizon⋆

IFAC Proceedings Volumes ◽

10.3182/20110828-6-it-1002.00835 ◽

2011 ◽

Vol 44 (1) ◽

pp. 13392-13397

Author(s):

Alexander Tarasyev ◽

Anastasya Usova

Keyword(s):

Optimal Control ◽

Hamiltonian Systems ◽

Value Function ◽

Optimal Control Problems ◽

Infinite Horizon ◽

Control Problems ◽

Linear Optimal Control ◽

Linear Optimal Control Problems ◽

The Value Function

Download Full-text

Surveys without Questions: A Reinforcement Learning Approach

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301257 ◽

2019 ◽

Vol 33 ◽

pp. 257-264 ◽

Cited By ~ 1

Author(s):

Atanu R Sinha ◽

Deepali Jain ◽

Nikhil Sheoran ◽

Sopan Khosla ◽

Reshmi Sasidharan

Keyword(s):

Reinforcement Learning ◽

Survey Data ◽

Value Function ◽

Specific Interactions ◽

Aggregate Level ◽

Clickstream Data ◽

Online Interactions ◽

Performance Results ◽

The Value Function ◽

Over Time

The ‘old world’ instrument, survey, remains a tool of choice for firms to obtain ratings of satisfaction and experience that customers realize while interacting online with firms. While avenues for survey have evolved from emails and links to pop-ups while browsing, the deficiencies persist. These include - reliance on ratings of very few respondents to infer about all customers’ online interactions; failing to capture a customer’s interactions over time since the rating is a one-time snapshot; and inability to tie back customers’ ratings to specific interactions because ratings provided relate to all interactions. To overcome these deficiencies we extract proxy ratings from clickstream data, typically collected for every customer’s online interactions, by developing an approach based on Reinforcement Learning (RL). We introduce a new way to interpret values generated by the value function of RL, as proxy ratings. Our approach does not need any survey data for training. Yet, on validation against actual survey data, proxy ratings yield reasonable performance results. Additionally, we offer a new way to draw insights from values of the value function, which allow associating specific interactions to their proxy ratings. We introduce two new metrics to represent ratings - one, customer-level and the other, aggregate-level for click actions across customers. Both are defined around proportion of all pairwise, successive actions that show increase in proxy ratings. This intuitive customer-level metric enables gauging the dynamics of ratings over time and is a better predictor of purchase than customer ratings from survey. The aggregate-level metric allows pinpointing actions that help or hurt experience. In sum, proxy ratings computed unobtrusively from clickstream, for every action, for each customer, and for every session can offer interpretable and more insightful alternative to surveys.

Download Full-text

Approximating the value function for continuous space reinforcement learning in robot control

IEEE/RSJ International Conference on Intelligent Robots and System ◽

10.1109/irds.2002.1041532 ◽

2003 ◽

Cited By ~ 2

Author(s):

S. Buck ◽

M. Beetz ◽

T. Schmitt

Keyword(s):

Reinforcement Learning ◽

Robot Control ◽

Value Function ◽

Continuous Space ◽

The Value Function

Download Full-text

Stochastic optimal control problem with infinite horizon driven by G-Brownian motion

ESAIM Control Optimisation and Calculus of Variations ◽

10.1051/cocv/2017044 ◽

2018 ◽

Vol 24 (2) ◽

pp. 873-899 ◽

Cited By ~ 1

Author(s):

Mingshang Hu ◽

Falei Wang

Keyword(s):

Optimal Control ◽

Brownian Motion ◽

Optimal Control Problem ◽

Control Problem ◽

Stochastic Optimal Control ◽

Value Function ◽

Infinite Horizon ◽

Dynamic Programming Principle ◽

Stochastic Optimal Control Problem ◽

The Value Function

The present paper considers a stochastic optimal control problem, in which the cost function is defined through a backward stochastic differential equation with infinite horizon driven by G-Brownian motion. Then we study the regularities of the value function and establish the dynamic programming principle. Moreover, we prove that the value function is the unique viscosity solution of the related Hamilton−Jacobi−Bellman−Isaacs (HJBI) equation.

Download Full-text

Reinforcement Learning with Dynamic Boltzmann Softmax Updates

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/276 ◽

2020 ◽

Author(s):

Ling Pan ◽

Qingpeng Cai ◽

Qi Meng ◽

Wei Chen ◽

Longbo Huang

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Convergence Property ◽

Important Task ◽

Experimental Results ◽

Value Iteration ◽

Function Estimation ◽

Good Convergence ◽

Direct Use ◽

The Value Function

Value function estimation is an important task in reinforcement learning, i.e., prediction. The Boltzmann softmax operator is a natural value estimator and can provide several benefits. However, it does not satisfy the non-expansion property, and its direct use may fail to converge even in value iteration. In this paper, we propose to update the value function with dynamic Boltzmann softmax (DBS) operator, which has good convergence property in the setting of planning and learning. Experimental results on GridWorld show that the DBS operator enables better estimation of the value function, which rectifies the convergence issue of the softmax operator. Finally, we propose the DBS-DQN algorithm by applying the DBS operator, which outperforms DQN substantially in 40 out of 49 Atari games.

Download Full-text