scholarly journals Online Pandora’s Boxes and Bandits

Author(s):  
Hossein Esfandiari ◽  
MohammadTaghi HajiAghayi ◽  
Brendan Lucier ◽  
Michael Mitzenmacher

We consider online variations of the Pandora’s box problem (Weitzman 1979), a standard model for understanding issues related to the cost of acquiring information for decision-making. Our problem generalizes both the classic Pandora’s box problem and the prophet inequality framework. Boxes are presented online, each with a random value and cost drawn jointly from some known distribution. Pandora chooses online whether to open each box given its cost, and then chooses irrevocably whether to keep the revealed prize or pass on it. We aim for approximation algorithms against adversaries that can choose the largest prize over any opened box, and use optimal offline policies to decide which boxes to open (without knowledge of the value inside)1. We consider variations where Pandora can collect multiple prizes subject to feasibility constraints, such as cardinality, matroid, or knapsack constraints. We also consider variations related to classic multi-armed bandit problems from reinforcement learning. Our results use a reduction-based framework where we separate the issues of the cost of acquiring information from the online decision process of which prizes to keep. Our work shows that in many scenarios, Pandora can achieve a good approximation to the best possible performance.

2018 ◽  
Author(s):  
Nura Sidarus ◽  
Stefano Palminteri ◽  
Valérian Chambon

AbstractValue-based decision-making involves trading off the cost associated with an action against its expected reward. Research has shown that both physical and mental effort constitute such subjective costs, biasing choices away from effortful actions, and discounting the value of obtained rewards. Facing conflicts between competing action alternatives is considered aversive, as recruiting cognitive control to overcome conflict is effortful. Yet, it remains unclear whether conflict is also perceived as a cost in value-based decisions. The present study investigated this question by embedding irrelevant distractors (flanker arrows) within a reversal-learning task, with intermixed free and instructed trials. Results showed that participants learned to adapt their choices to maximize rewards, but were nevertheless biased to follow the suggestions of irrelevant distractors. Thus, the perceived cost of being in conflict with an external suggestion could sometimes trump internal value representations. By adapting computational models of reinforcement learning, we assessed the influence of conflict at both the decision and learning stages. Modelling the decision showed that conflict was avoided when evidence for either action alternative was weak, demonstrating that the cost of conflict was traded off against expected rewards. During the learning phase, we found that learning rates were reduced in instructed, relative to free, choices. Learning rates were further reduced by conflict between an instruction and subjective action values, whereas learning was not robustly influenced by conflict between one’s actions and external distractors. Our results show that the subjective cost of conflict factors into value-based decision-making, and highlights that different types of conflict may have different effects on learning about action outcomes.


Author(s):  
Jaskanwal P. S. Chhabra ◽  
Gordon P. Warn

Engineers often employ, formally or informally, multi-fidelity computational models to aid design decision making. For example, recently the idea of viewing design as a Sequential Decision Process (SDP) provides a formal framework of sequencing multi-fidelity models to realize computational gains in the design process. Efficiency is achieved in the SDP because dominated designs are removed using less expensive (low-fidelity) models before using higher-fidelity models with the guarantee the antecedent model only removes design solutions that are dominated when analyzed using more detailed, higher-fidelity models. The set of multi-fidelity models and discrete decision states result in a combinatorial combination of modeling sequences, some of which require significantly fewer model evaluations than others. It is desirable to optimally sequence models; however, the optimal modeling policy can not be determined at the onset of SDP because the computational cost and discriminatory power of executing all models on all designs is unknown. In this study, the model selection problem is formulated as a Markov Decision Process and a classical reinforcement learning, namely Qlearning, is investigated to obtain and follow an approximately optimal modeling policy. The outcome is a methodology able to learn efficient sequencing of models by estimating their computational cost and discriminatory power while analyzing designs in the tradespace throughout the design process. Through application to a design example, the methodology is shown to: 1) effectively identify the approximate optimal modeling policy, and 2) efficiently converge upon a choice set.


2016 ◽  
Vol 113 (45) ◽  
pp. 12868-12873 ◽  
Author(s):  
Mehdi Keramati ◽  
Peter Smittenaar ◽  
Raymond J. Dolan ◽  
Peter Dayan

Behavioral and neural evidence reveal a prospective goal-directed decision process that relies on mental simulation of the environment, and a retrospective habitual process that caches returns previously garnered from available choices. Artificial systems combine the two by simulating the environment up to some depth and then exploiting habitual values as proxies for consequences that may arise in the further future. Using a three-step task, we provide evidence that human subjects use such a normative plan-until-habit strategy, implying a spectrum of approaches that interpolates between habitual and goal-directed responding. We found that increasing time pressure led to shallower goal-directed planning, suggesting that a speed-accuracy tradeoff controls the depth of planning with deeper search leading to more accurate evaluation, at the cost of slower decision-making. We conclude that subjects integrate habit-based cached values directly into goal-directed evaluations in a normative manner.


2015 ◽  
Vol 54 ◽  
pp. 233-275 ◽  
Author(s):  
Meir Kalech ◽  
Shulamit Reches

When to make a decision is a key question in decision making problems characterized by uncertainty. In this paper we deal with decision making in environments where information arrives dynamically. We address the tradeoff between waiting and stopping strategies. On the one hand, waiting to obtain more information reduces uncertainty, but it comes with a cost. Stopping and making a decision based on an expected utility reduces the cost of waiting, but the decision is based on uncertain information. We propose an optimal algorithm and two approximation algorithms. We prove that one approximation is optimistic - waits at least as long as the optimal algorithm, while the other is pessimistic - stops not later than the optimal algorithm. We evaluate our algorithms theoretically and empirically and show that the quality of the decision in both approximations is near-optimal and much faster than the optimal algorithm. Also, we can conclude from the experiments that the cost function is a key factor to chose the most effective algorithm.


Author(s):  
Clement Leung ◽  
Nikki Lijing Kuang ◽  
Vienne W. K. Sung

Organizations need to constantly learn, develop, and evaluate new strategies and policies for their effective operation. Unsupervised reinforcement learning is becoming a highly useful tool, since rewards and punishments in different forms are pervasive and present in a wide variety of decision-making scenarios. By observing the outcome of a sufficient number of repeated trials, one would gradually learn the value and usefulness of a particular policy or strategy. However, in a given environment, the outcomes resulting from different trials are subject to external chance influence and variations. In learning about the usefulness of a given policy, significant costs are involved in systematically undertaking the sequential trials; therefore, in most learning episodes, one would wish to keep the cost within bounds by adopting learning efficient stopping rules. In this Chapter, we explain the deployment of different learning strategies in given environments for reinforcement learning policy evaluation and review, and we present suggestions for their practical use and applications.


2021 ◽  
Author(s):  
Maximilian Puelma Touzel ◽  
Paul Cisek ◽  
Guillaume Lajoie

The value we place on our time impacts what we decide to do with it. Value it too little, and we obsess over all details. Value it too much, and we rush carelessly to move on. How to strike this often context-specific balance is a challenging decision-making problem. Average-reward, putatively encoded by tonic dopamine, serves in existing reinforcement learning theory as the stationary opportunity cost of time. However, environmental context and the cost of deliberation therein often varies in time and is hard to infer and predict. Here, we define a non-stationary opportunity cost of deliberation arising from performance variation on multiple timescales. Estimated from reward history, this cost readily adapts to reward-relevant changes in context and suggests a generalization of average-reward reinforcement learning (AR-RL) to account for non-stationary contextual factors. We use this deliberation cost in a simple decision-making heuristic called Performance-Gated Deliberation, which approximates AR-RL and is consistent with empirical results in both cognitive and systems decision-making neuroscience. We propose that deliberation cost is implemented directly as urgency, a previously characterized neural signal effectively controlling the speed of the decision-making process. We use behaviour and neural recordings from non-human primates in a non-stationary random walk prediction task to support our results. We make readily testable predictions for both neural activity and behaviour and discuss how this proposal can facilitate future work in cognitive and systems neuroscience of reward-driven behaviour.


Author(s):  
Syed Ihtesham Hussain Shah ◽  
Giuseppe De Pietro

In decision-making problems reward function plays an important role in finding the best policy. Reinforcement Learning (RL) provides a solution for decision-making problems under uncertainty in an Intelligent Environment (IE). However, it is difficult to specify the reward function for RL agents in large and complex problems. To counter these problems an extension of RL problem named Inverse Reinforcement Learning (IRL) is introduced, where reward function is learned from expert demonstrations. IRL is appealing for its potential use to build autonomous agents, capable of modeling others, deprived of compromising in performance of the task. This approach of learning by demonstrations relies on the framework of Markov Decision Process (MDP). This article elaborates original IRL algorithms along with their close variants to mitigate challenges. The purpose of this paper is to highlight an overview and theoretical background of IRL in the field of Machine Learning (ML) and Artificial Intelligence (AI). We presented a brief comparison between different variants of IRL in this article.


Author(s):  
Robin Markwica

In coercive diplomacy, states threaten military action to persuade opponents to change their behavior. The goal is to achieve a target’s compliance without incurring the cost in blood and treasure of military intervention. Coercers typically employ this strategy toward weaker actors, but targets often refuse to submit and the parties enter into war. To explain these puzzling failures of coercive diplomacy, existing accounts generally refer to coercers’ perceived lack of resolve or targets’ social norms and identities. What these approaches either neglect or do not examine systematically is the role that emotions play in these encounters. The present book contends that target leaders’ affective experience can shape their decision-making in significant ways. Drawing on research in psychology and sociology, the study introduces an additional, emotion-based action model besides the traditional logics of consequences and appropriateness. This logic of affect, or emotional choice theory, posits that target leaders’ choice behavior is influenced by the dynamic interplay between their norms, identities, and five key emotions, namely fear, anger, hope, pride, and humiliation. The core of the action model consists of a series of propositions that specify the emotional conditions under which target leaders are likely to accept or reject a coercer’s demands. The book applies the logic of affect to Nikita Khrushchev’s decision-making during the Cuban missile crisis in 1962 and Saddam Hussein’s choice behavior in the Gulf conflict in 1990–91, offering a novel explanation for why coercive diplomacy succeeded in one case but not in the other.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Batel Yifrah ◽  
Ayelet Ramaty ◽  
Genela Morris ◽  
Avi Mendelsohn

AbstractDecision making can be shaped both by trial-and-error experiences and by memory of unique contextual information. Moreover, these types of information can be acquired either by means of active experience or by observing others behave in similar situations. The interactions between reinforcement learning parameters that inform decision updating and memory formation of declarative information in experienced and observational learning settings are, however, unknown. In the current study, participants took part in a probabilistic decision-making task involving situations that either yielded similar outcomes to those of an observed player or opposed them. By fitting alternative reinforcement learning models to each subject, we discerned participants who learned similarly from experience and observation from those who assigned different weights to learning signals from these two sources. Participants who assigned different weights to their own experience versus those of others displayed enhanced memory performance as well as subjective memory strength for episodes involving significant reward prospects. Conversely, memory performance of participants who did not prioritize their own experience over others did not seem to be influenced by reinforcement learning parameters. These findings demonstrate that interactions between implicit and explicit learning systems depend on the means by which individuals weigh relevant information conveyed via experience and observation.


Sign in / Sign up

Export Citation Format

Share Document