Enhancing Greedy Policy Techniques for Complex Cost-Sensitive Problems

The Bayesian Prophet: A Low-Regret Framework for Online Decision Making

Management Science ◽

10.1287/mnsc.2020.3624 ◽

2020 ◽

Author(s):

Alberto Vera ◽

Siddhartha Banerjee

Keyword(s):

Decision Making ◽

Big Data Analytics ◽

General Technique ◽

Matching Problems ◽

Online Matching ◽

Additive Loss ◽

Online Decision Making ◽

Resource Levels ◽

Greedy Policy ◽

New Framework

We develop a new framework for designing online policies given access to an oracle providing statistical information about an off-line benchmark. Having access to such prediction oracles enables simple and natural Bayesian selection policies and raises the question as to how these policies perform in different settings. Our work makes two important contributions toward this question: First, we develop a general technique we call compensated coupling, which can be used to derive bounds on the expected regret (i.e., additive loss with respect to a benchmark) for any online policy and off-line benchmark. Second, using this technique, we show that a natural greedy policy, which we call the Bayes selector, has constant expected regret (i.e., independent of the number of arrivals and resource levels) for a large class of problems we refer to as “online allocation with finite types,” which includes widely studied online packing and online matching problems. Our results generalize and simplify several existing results for online packing and online matching and suggest a promising pathway for obtaining oracle-driven policies for other online decision-making settings. This paper was accepted by George Shanthikumar, big data analytics.

Download Full-text

Stability and Instability of the MaxWeight Policy

Mathematics of Operations Research ◽

10.1287/moor.2020.1106 ◽

2021 ◽

Author(s):

Maury Bramson ◽

Bernardo D’Auria ◽

Neil Walton

Keyword(s):

Communication Networks ◽

Queueing Networks ◽

Fluid Model ◽

Queueing Network ◽

Traffic Intensity ◽

Weighted Version ◽

Simple Formulation ◽

Model Variant ◽

Stability And Instability ◽

Greedy Policy

Consider a switched queueing network with general routing among its queues. The MaxWeight policy assigns available service by maximizing the objective function [Formula: see text] among the different feasible service options, where [Formula: see text] denotes queue size and [Formula: see text] denotes the amount of service to be executed at queue [Formula: see text]. MaxWeight is a greedy policy that does not depend on knowledge of arrival rates and is straightforward to implement. These properties and its simple formulation suggest MaxWeight as a serious candidate for implementation in the setting of switched queueing networks; MaxWeight has been extensively studied in the context of communication networks. However, a fluid model variant of MaxWeight was previously shown not to be maximally stable. Here, we prove that MaxWeight itself is not in general maximally stable. We also prove MaxWeight is maximally stable in a much more restrictive setting, and that a weighted version of MaxWeight, where the weighting depends on the traffic intensity, is always stable.

Download Full-text

Exploration with Multiple Random ε-Buffers in Off-Policy Deep Reinforcement Learning

Symmetry ◽

10.3390/sym11111352 ◽

2019 ◽

Vol 11 (11) ◽

pp. 1352 ◽

Cited By ~ 1

Author(s):

Kim ◽

Park

Keyword(s):

Reinforcement Learning ◽

Experimental Comparison ◽

Continuous Control ◽

Policy Gradient ◽

Experience Replay ◽

Discrete Action ◽

Original Goal ◽

Efficient Exploration ◽

Greedy Policy ◽

Theoretical Results

In terms of deep reinforcement learning (RL), exploration is highly significant in achieving better generalization. In benchmark studies, ε-greedy random actions have been used to encourage exploration and prevent over-fitting, thereby improving generalization. Deep RL with random ε-greedy policies, such as deep Q-networks (DQNs), can demonstrate efficient exploration behavior. A random ε-greedy policy exploits additional replay buffers in an environment of sparse and binary rewards, such as in the real-time online detection of network securities by verifying whether the network is “normal or anomalous.” Prior studies have illustrated that a prioritized replay memory attributed to a complex temporal difference error provides superior theoretical results. However, another implementation illustrated that in certain environments, the prioritized replay memory is not superior to the randomly-selected buffers of random ε-greedy policy. Moreover, a key challenge of hindsight experience replay inspires our objective by using additional buffers corresponding to each different goal. Therefore, we attempt to exploit multiple random ε-greedy buffers to enhance explorations for a more near-perfect generalization with one original goal in off-policy RL. We demonstrate the benefit of off-policy learning from our method through an experimental comparison of DQN and a deep deterministic policy gradient in terms of discrete action, as well as continuous control for complete symmetric environments.

Download Full-text

A Memory-Greedy Policy With Guaranteed Convergence for Accelerating Reinforcement Learning

Journal of Autonomous Vehicles and Systems ◽

10.1115/1.4049539 ◽

2021 ◽

Vol 1 (1) ◽

Author(s):

Xinglin Yu ◽

Yuhu Wu ◽

Xi-Ming Sun ◽

Wenya Zhou

Keyword(s):

Dynamic Programming ◽

Reinforcement Learning ◽

Memory Storage ◽

Exploration And Exploitation ◽

Q Learning ◽

Greedy Policy

Abstract Balancing the exploration and exploitation in reinforcement learning is a commonly dilemma and time-wasting work. In this paper, a novel exploration policy used in Q-Learning, called Memory-greedy policy, is proposed to accelerate learning. By memory storage and playback, the probability of random action selecting can be effectively dealt with or reduced, which hence speeds up learning. The principle of this policy is analyzed by maze scene, and the theoretical convergence is given according to dynamic programming.

Download Full-text

HLF-D*: An Approximate Greedy Policy for Age-based Information Freshness of Real-time Updates

ICC 2021 - IEEE International Conference on Communications ◽

10.1109/icc42927.2021.9500597 ◽

2021 ◽

Author(s):

Devarpita Sinha ◽

Rajarshi Roy

Keyword(s):

Real Time ◽

Greedy Policy

Download Full-text

Average Throughput Performance of Greedy Policy in Cognitive Radio Enabled Vehicular Networks

2021 29th Signal Processing and Communications Applications Conference (SIU) ◽

10.1109/siu53274.2021.9477926 ◽

2021 ◽

Author(s):

Omer Melih Gul

Keyword(s):

Cognitive Radio ◽

Vehicular Networks ◽

Average Throughput ◽

Throughput Performance ◽

Greedy Policy

Download Full-text

Multiplexing of Backup VMs Based on Greedy Policy

2014 11th Web Information System and Application Conference ◽

10.1109/wisa.2014.43 ◽

2014 ◽

Author(s):

Xinyi Li ◽

Hui He ◽

Pengfei Chen ◽

Xiaohui Zhang ◽

Li Su ◽

...

Keyword(s):

Greedy Policy

Download Full-text

Queueing in space

Advances in Applied Probability ◽

10.1017/s000186780002677x ◽

1994 ◽

Vol 26 (04) ◽

pp. 1095-1116 ◽

Cited By ~ 2

Author(s):

Eitan Altman ◽

Hanoch Levy

Keyword(s):

Convex Space ◽

Dimensional Space ◽

Three Dimensional ◽

Finite Size ◽

System Stability ◽

Single Server ◽

Greedy Method ◽

Service Policies ◽

The Stability ◽

Greedy Policy

We consider a problem in which a single server must serve a stream of customers whose arrivals are distributed over a finite-size convex space. Under the assumption that the server has full information on the customer location, obvious service policies are the FCFS and the greedy (serve-the-closest-customer) approaches. These algorithms are, however, either inefficient (FCFS) or ‘unfair' (greedy). We propose and study two alternative algorithms, the gated-greedy policy and the gated-scan policy, which are more ‘fair' than the pure greedy method. We show that the stability conditions of the gated-greedy are p < 1 (where p is the expected rate at which work arrives at the system), implying that the method is at least as efficient (in terms of system stability) as any other discipline, in particular the greedy one. For the gated-scan policy we show that for any p < 1 one can design a stable gated-scan policy; however, for any fixed gated-scan policy there exists p < 1 for which the policy is unstable. We evaluate the performance of the gated-scan policy, and present bounds for the performance of the gated-greedy policy. These results are derived for systems in which the arrivals occur on a two-dimensional space (a square) but they are not limited to this configuration; rather they hold for more complex N-dimensional spaces, in particular for serving customers in (three-dimensional) convex space and serving customers on a line.

Download Full-text

A Structured Multiarmed Bandit Problem and the Greedy Policy

IEEE Transactions on Automatic Control ◽

10.1109/tac.2009.2031725 ◽

2009 ◽

Vol 54 (12) ◽

pp. 2787-2802 ◽

Cited By ~ 35

Author(s):

A.J. Mersereau ◽

P. Rusmevichientong ◽

J.N. Tsitsiklis

Keyword(s):

Bandit Problem ◽

Greedy Policy ◽

Multiarmed Bandit

Download Full-text

Adaptive Influence Maximization

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3447396 ◽

2021 ◽

Vol 15 (5) ◽

pp. 1-23

Author(s):

Jianxiong Guo ◽

Weili Wu

Keyword(s):

Seed Set ◽

Influence Maximization ◽

Maximization Problem ◽

Small Subset ◽

Influence Spread ◽

Knapsack Constraint ◽

Real Scenario ◽

Submodular Maximization ◽

Greedy Policy ◽

Influence Maximization Problem

Influence maximization problem attempts to find a small subset of nodes that makes the expected influence spread maximized, which has been researched intensively before. They all assumed that each user in the seed set we select is activated successfully and then spread the influence. However, in the real scenario, not all users in the seed set are willing to be an influencer. Based on that, we consider each user associated with a probability with which we can activate her as a seed, and we can attempt to activate her many times. In this article, we study the adaptive influence maximization with multiple activations (Adaptive-IMMA) problem, where we select a node in each iteration, observe whether she accepts to be a seed, if yes, wait to observe the influence diffusion process; if no, we can attempt to activate her again with a higher cost or select another node as a seed. We model the multiple activations mathematically and define it on the domain of integer lattice. We propose a new concept, adaptive dr-submodularity, and show our Adaptive-IMMA is the problem that maximizing an adaptive monotone and dr-submodular function under the expected knapsack constraint. Adaptive dr-submodular maximization problem is never covered by any existing studies. Thus, we summarize its properties and study its approximability comprehensively, which is a non-trivial generalization of existing analysis about adaptive submodularity. Besides, to overcome the difficulty to estimate the expected influence spread, we combine our adaptive greedy policy with sampling techniques without losing the approximation ratio but reducing the time complexity. Finally, we conduct experiments on several real datasets to evaluate the effectiveness and efficiency of our proposed policies.

Download Full-text