scholarly journals Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes

Author(s):  
Shun Zhang ◽  
Edmund H. Durfee ◽  
Satinder Singh

As it achieves a goal on behalf of its human user, an autonomous agent's actions may have side effects that change features of its environment in ways that negatively surprise its user. An agent that can be trusted to operate safely should thus only change features the user has explicitly permitted. We formalize this problem, and develop a planning algorithm that avoids potentially negative side effects given what the agent knows about (un)changeable features. Further, we formulate a provably minimax-regret querying strategy for the agent to selectively ask the user about features that it hasn't explicitly been told about. We empirically show how much faster it is than a more exhaustive approach and how much better its queries are than those found by the best known heuristic.

Author(s):  
Sandhya Saisubramanian ◽  
Ece Kamar ◽  
Shlomo Zilberstein

Agents operating in unstructured environments often create negative side effects (NSE) that may not be easy to identify at design time. We examine how various forms of human feedback or autonomous exploration can be used to learn a penalty function associated with NSE during system deployment. We formulate the problem of mitigating the impact of NSE as a multi-objective Markov decision process with lexicographic reward preferences and slack. The slack denotes the maximum deviation from an optimal policy with respect to the agent's primary objective allowed in order to mitigate NSE as a secondary objective. Empirical evaluation of our approach shows that the proposed framework can successfully mitigate NSE and that different feedback mechanisms introduce different biases, which influence the identification of NSE.


2017 ◽  
Vol 59 ◽  
pp. 229-264 ◽  
Author(s):  
Asrar Ahmed ◽  
Pradeep Varakantham ◽  
Meghna Lowalekar ◽  
Yossiri Adulyasak ◽  
Patrick Jaillet

Markov Decision Processes (MDPs) are an effective model to represent decision processes in the presence of transitional uncertainty and reward tradeoffs. However, due to the difficulty in exactly specifying the transition and reward functions in MDPs, researchers have proposed uncertain MDP models and robustness objectives in solving those models. Most approaches for computing robust policies have focused on the computation of maximin policies which maximize the value in the worst case amongst all realisations of uncertainty. Given the overly conservative nature of maximin policies, recent work has proposed minimax regret as an ideal alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only and they are also limited in their scalability. Therefore, we provide a general model of uncertain MDPs that considers uncertainty over both transition and reward functions. Furthermore, we also consider dependence of the uncertainty across different states and decision epochs. We also provide a mixed integer linear program formulation for minimizing regret given a set of samples of the transition and reward functions in the uncertain MDP. In addition, we provide two myopic variants of regret, namely Cumulative Expected Myopic Regret (CEMR) and One Step Regret (OSR) that can be optimized in a scalable manner. Specifically, we provide dynamic programming and policy iteration based algorithms to optimize CEMR and OSR respectively. Finally, to demonstrate the effectiveness of our approaches, we provide comparisons on two benchmark problems from literature. We observe that optimizing the myopic variants of regret, OSR and CEMR are better than directly optimizing the regret.


2020 ◽  
Vol 34 (03) ◽  
pp. 2552-2559
Author(s):  
Shun Zhang ◽  
Edmund Durfee ◽  
Satinder Singh

An autonomous agent acting on behalf of a human user has the potential of causing side-effects that surprise the user in unsafe ways. When the agent cannot formulate a policy with only side-effects it knows are safe, it needs to selectively query the user about whether other useful side-effects are safe. Our goal is an algorithm that queries about as few potential side-effects as possible to find a safe policy, or to prove that none exists. We extend prior work on irreducible infeasible sets to also handle our problem's complication that a constraint to avoid a side-effect cannot be relaxed without user permission. By proving that our objectives are also adaptive submodular, we devise a querying algorithm that we empirically show finds nearly-optimal queries with much less computation than a guaranteed-optimal approach, and outperforms competing approximate approaches.


2020 ◽  
Vol 34 (06) ◽  
pp. 9794-9801
Author(s):  
Tomáš Brázdil ◽  
Krishnendu Chatterjee ◽  
Petr Novotný ◽  
Jiří Vahala

Markov decision processes (MDPs) are the defacto framework for sequential decision making in the presence of stochastic uncertainty. A classical optimization criterion for MDPs is to maximize the expected discounted-sum payoff, which ignores low probability catastrophic events with highly negative impact on the system. On the other hand, risk-averse policies require the probability of undesirable events to be below a given threshold, but they do not account for optimization of the expected payoff. We consider MDPs with discounted-sum payoff with failure states which represent catastrophic outcomes. The objective of risk-constrained planning is to maximize the expected discounted-sum payoff among risk-averse policies that ensure the probability to encounter a failure state is below a desired threshold. Our main contribution is an efficient risk-constrained planning algorithm that combines UCT-like search with a predictor learned through interaction with the MDP (in the style of AlphaZero) and with a risk-constrained action selection via linear programming. We demonstrate the effectiveness of our approach with experiments on classical MDPs from the literature, including benchmarks with an order of 106 states.


1983 ◽  
Vol 20 (04) ◽  
pp. 835-842
Author(s):  
David Assaf

The paper presents sufficient conditions for certain functions to be convex. Functions of this type often appear in Markov decision processes, where their maximum is the solution of the problem. Since a convex function takes its maximum at an extreme point, the conditions may greatly simplify a problem. In some cases a full solution may be obtained after the reduction is made. Some illustrative examples are discussed.


Sign in / Sign up

Export Citation Format

Share Document