Risk-Sensitive Reinforcement Learning Applied to Control under Constraints

Journal of Artificial Intelligence Research ◽

10.1613/jair.1666 ◽

2005 ◽

Vol 24 ◽

pp. 81-108 ◽

Cited By ~ 65

Author(s):

P. Geibel ◽

F. Wysotzki

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Learning Algorithm ◽

Optimal Solution ◽

Feed Tank ◽

Model Free ◽

Constrained Problem ◽

Risk Sensitive ◽

Markov Decision ◽

The Value Function

In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.

Download Full-text

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

Journal of Artificial Intelligence Research ◽

10.1613/jair.639 ◽

2000 ◽

Vol 13 ◽

pp. 227-303 ◽

Cited By ~ 389

Author(s):

T. G. Dietterich

Keyword(s):

Reinforcement Learning ◽

Optimal Policy ◽

Value Function ◽

Learning Algorithm ◽

Value Functions ◽

Procedural Semantics ◽

Hierarchical Reinforcement Learning ◽

Model Free ◽

Function Decomposition ◽

The Value Function

This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics---as a subroutine hierarchy---and a declarative semantics---as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this non-hierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

Download Full-text

Solving flow-shop scheduling problem with a reinforcement learning algorithm that generalizes the value function with neural network

Alexandria Engineering Journal ◽

10.1016/j.aej.2021.01.030 ◽

2021 ◽

Vol 60 (3) ◽

pp. 2787-2800

Author(s):

Jianfeng Ren ◽

Chunming Ye ◽

Feng Yang

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Value Function ◽

Flow Shop ◽

Learning Algorithm ◽

Flow Shop Scheduling ◽

Scheduling Problem ◽

Shop Scheduling ◽

The Value Function ◽

Reinforcement Learning Algorithm

Download Full-text

A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms

Neural Computation ◽

10.1162/089976699300016070 ◽

1999 ◽

Vol 11 (8) ◽

pp. 2017-2060 ◽

Cited By ~ 70

Author(s):

Csaba Szepesvári ◽

Michael L. Littman

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Learning Algorithm ◽

Learning Algorithms ◽

Sequential Decision ◽

Q Learning ◽

Markov Games ◽

Optimal Behavior ◽

Risk Sensitive ◽

Optimal Value

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.

Download Full-text

Reinforcement Learning for Optimizing Driving Policies on Cruising Taxis Services

Sustainability ◽

10.3390/su12218883 ◽

2020 ◽

Vol 12 (21) ◽

pp. 8883

Author(s):

Kun Jin ◽

Wei Wang ◽

Xuedong Hua ◽

Wei Zhou

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

State Action ◽

Future Reward ◽

Long Run ◽

Markov Decision ◽

Action Value ◽

Data Expansion ◽

Taking Action ◽

The Value Function

As the key element of urban transportation, taxis services significantly provide convenience and comfort for residents’ travel. However, the reality has not shown much efficiency. Previous researchers mainly aimed to optimize policies by order dispatch on ride-hailing services, which cannot be applied in cruising taxis services. This paper developed the reinforcement learning (RL) framework to optimize driving policies on cruising taxis services. Firstly, we formulated the drivers’ behaviours as the Markov decision process (MDP) progress, considering the influences after taking action in the long run. The RL framework using dynamic programming and data expansion was employed to calculate the state-action value function. Following the value function, drivers can determine the best choice and then quantify the expected future reward at a particular state. By utilizing historic orders data in Chengdu, we analysed the function value’s spatial distribution and demonstrated how the model could optimize the driving policies. Finally, the realistic simulation of the on-demand platform was built. Compared with other benchmark methods, the results verified that the new model performs better in increasing total revenue, answer rate and decreasing waiting time, with the relative percentages of 4.8%, 6.2% and −27.27% at most.

Download Full-text

Robust Reinforcement Learning

Neural Computation ◽

10.1162/0899766053011528 ◽

2005 ◽

Vol 17 (2) ◽

pp. 335-359 ◽

Cited By ~ 45

Author(s):

Jun Morimoto ◽

Kenji Doya

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Learning Algorithm ◽

Action Planning ◽

Control Agent ◽

Control Input ◽

Environmental Models ◽

Online Learning Algorithms ◽

The Difference ◽

The Value Function

This letter proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both off-line learning using simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, and often unwanted, results. Based on the theory of H∞ control, we consider a differential game in which a “disturbing” agent tries to make the worst possible disturbance while a “control” agent tries to make the best control input. The problem is formulated as finding a min-max solution of a value function that takes into account the amount of the reward and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call robust reinforcement learning (RRL), on the control task of an inverted pendulum. In the linear domain, the policy and the value function learned by online algorithms coincided with those derived analytically by the linear H∞ control theory. For a fully nonlinear swing-up task, RRL achieved robust performance with changes in the pendulum weight and friction, while a standard reinforcement learning algorithm could not deal with these changes. We also applied RRL to the cart-pole swing-up task, and a robust swing-up policy was acquired.

Download Full-text

Risk-Sensitive Reinforcement Learning

Neural Computation ◽

10.1162/neco_a_00600 ◽

2014 ◽

Vol 26 (7) ◽

pp. 1298-1328 ◽

Cited By ~ 19

Author(s):

Yun Shen ◽

Michael J. Tobia ◽

Tobias Sommer ◽

Klaus Obermayer

Keyword(s):

Reinforcement Learning ◽

Prospect Theory ◽

Human Behavior ◽

Transition Probabilities ◽

Learning Algorithm ◽

Sequential Decision ◽

Uncertain Environments ◽

Risk Sensitive ◽

Markov Decision ◽

Q Values

We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference (TD) error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents’ behaviors express key features of human behavior as predicted by prospect theory (Kahneman & Tversky, 1979 ), for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.

Download Full-text

Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33017949 ◽

2019 ◽

Vol 33 ◽

pp. 7949-7956

Author(s):

Silviu Pitis

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Value Function ◽

Discount Factor ◽

Theoretic Approach ◽

Decision Theoretic Approach ◽

Markov Decision ◽

The Value Function

Reinforcement learning (RL) agents have traditionally been tasked with maximizing the value function of a Markov decision process (MDP), either in continuous settings, with fixed discount factor γ

Download Full-text

A reinforcement learning algorithm with fuzzy approximation for semi Markov decision problems

Journal of Intelligent & Fuzzy Systems ◽

10.3233/ifs-141460 ◽

2015 ◽

Vol 28 (4) ◽

pp. 1733-1744 ◽

Cited By ~ 1

Author(s):

Ufuk Kula ◽

Beyazıt Ocaktan

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Decision Problems ◽

Fuzzy Approximation ◽

Markov Decision Problems ◽

Markov Decision ◽

Reinforcement Learning Algorithm

Download Full-text

A Multi-Step Reinforcement Learning Algorithm

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.44-47.3611 ◽

2010 ◽

Vol 44-47 ◽

pp. 3611-3615 ◽

Cited By ~ 1

Author(s):

Zhi Cong Zhang ◽

Kai Shun Hu ◽

Hui Yu Huang ◽

Shuai Li ◽

Shao Yong Zhao

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Large Scale ◽

Learning Algorithm ◽

Machine Learning Method ◽

Learning Method ◽

K Value ◽

Markov Decision ◽

Action Value

Reinforcement learning (RL) is a state or action value based machine learning method which approximately solves large-scale Markov Decision Process (MDP) or Semi-Markov Decision Process (SMDP). A multi-step RL algorithm called Sarsa(,k) is proposed, which is a compromised variation of Sarsa and Sarsa(). It is equivalent to Sarsa if k is 1 and is equivalent to Sarsa() if k is infinite. Sarsa(,k) adjust its performance by setting k value. Two forms of Sarsa(,k), forward view Sarsa(,k) and backward view Sarsa(,k), are constructed and proved equivalent in off-line updating.

Download Full-text