Reinforcement Learning-Based Multihop Relaying: A Decentralized Q-Learning Approach

Conventional optimization-based relay selection for multihop networks cannot resolve the conflict between performance and cost. The optimal selection policy is centralized and requires local channel state information (CSI) of all hops, leading to high computational complexity and signaling overhead. Other optimization-based decentralized policies cause non-negligible performance loss. In this paper, we exploit the benefits of reinforcement learning in relay selection for multihop clustered networks and aim to achieve high performance with limited costs. Multihop relay selection problem is modeled as Markov decision process (MDP) and solved by a decentralized Q-learning scheme with rectified update function. Simulation results show that this scheme achieves near-optimal average end-to-end (E2E) rate. Cost analysis reveals that it also reduces computation complexity and signaling overhead compared with the optimal scheme.

Download Full-text

Cloud Load Balancing and Reinforcement Learning

Advances in Business Information Systems and Analytics - Cloud Computing Technologies for Green Enterprises ◽

10.4018/978-1-5225-3038-1.ch011 ◽

2018 ◽

pp. 266-291

Author(s):

Abdelghafour Harraz ◽

Mostapha Zbakh

Keyword(s):

Artificial Intelligence ◽

Reinforcement Learning ◽

Load Balancing ◽

Decision Process ◽

Cloud System ◽

Human Intervention ◽

Q Learning ◽

State Action ◽

Learning Techniques ◽

Markov Decision

Artificial Intelligence allows to create engines that are able to explore, learn environments and therefore create policies that permit to control them in real time with no human intervention. It can be applied, through its Reinforcement Learning techniques component, using frameworks such as temporal differences, State-Action-Reward-State-Action (SARSA), Q Learning to name a few, to systems that are be perceived as a Markov Decision Process, this opens door in front of applying Reinforcement Learning to Cloud Load Balancing to be able to dispatch load dynamically to a given Cloud System. The authors will describe different techniques that can used to implement a Reinforcement Learning based engine in a cloud system.

Download Full-text

Classification with Costly Features Using Deep Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013959 ◽

2019 ◽

Vol 33 ◽

pp. 3959-3966 ◽

Cited By ~ 8

Author(s):

Jaromír Janisch ◽

Tomáš Pevný ◽

Viliam Lisý

Keyword(s):

Reinforcement Learning ◽

Linear Approximation ◽

High Performance ◽

Classification Problem ◽

Classification Error ◽

Sequential Decision ◽

Learning Enhancement ◽

Q Learning ◽

Prior Art ◽

Feature Values

We study a classification problem where each feature can be acquired for a cost and the goal is to optimize a trade-off between the expected classification error and the feature cost. We revisit a former approach that has framed the problem as a sequential decision-making problem and solved it by Q-learning with a linear approximation, where individual actions are either requests for feature values or terminate the episode by providing a classification decision. On a set of eight problems, we demonstrate that by replacing the linear approximation with neural networks the approach becomes comparable to the state-of-the-art algorithms developed specifically for this problem. The approach is flexible, as it can be improved with any new reinforcement learning enhancement, it allows inclusion of pre-trained high-performance classifier, and unlike prior art, its performance is robust across all evaluated datasets.

Download Full-text

Reinforcement Learning Applied to a Differential Game

Adaptive Behavior ◽

10.1177/105971239500400102 ◽

1995 ◽

Vol 4 (1) ◽

pp. 3-28 ◽

Cited By ~ 15

Author(s):

Mance E. Harmon ◽

Leemon C. Baird ◽

A. Harry Klopf

Keyword(s):

Reinforcement Learning ◽

Differential Game ◽

Learning Algorithm ◽

Learning System ◽

Test Bed ◽

Linear Quadratic ◽

Time Step ◽

Q Learning ◽

Step Duration ◽

Markov Decision

An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual-gradient form of advantage updating. The game is a Markov decision process with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. Although a missile and plane scenario was the chosen test bed, the reinforcement learning approach presented here is equally applicable to biologically based systems, such as a predator pursuing prey. The reinforcement learning algorithm for optimal control is modified for differential games to find the minimax point rather than the maximum. Simulation results are compared to the analytical solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual-gradient and non-residual-gradient forms of advantage updating and Q-learning are compared, demonstrating that advantage updating converges faster than Q-learning in all simulations. Advantage updating also is demonstrated to converge regardless of the time step duration; Q-learning is unable to converge as the time step duration grows small.

Download Full-text

Q-Learning Based Optimum Relay Selection for a SWIPT-Enabled Wireless System

10.1007/978-3-030-90196-7_10 ◽

2021 ◽

pp. 100-107

Author(s):

Haojie Wang ◽

Bo Li

Keyword(s):

Relay Selection ◽

Wireless System ◽

Q Learning ◽

Selection For

Download Full-text

Reinforcement Learning under Threats

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019939 ◽

2019 ◽

Vol 33 ◽

pp. 9939-9940 ◽

Cited By ~ 1

Author(s):

Victor Gallego ◽

Roi Naveiro ◽

David Rios Insua

Keyword(s):

Reinforcement Learning ◽

Single Agent ◽

Potential Threat ◽

Q Learning ◽

Learning Framework ◽

Opponent Modeling ◽

Theoretical Approaches ◽

New Learning ◽

Markov Decision ◽

Multi Agent

In several reinforcement learning (RL) scenarios, mainly in security settings, there may be adversaries trying to interfere with the reward generating process. However, when non-stationary environments as such are considered, Q-learning leads to suboptimal results (Busoniu, Babuska, and De Schutter 2010). Previous game-theoretical approaches to this problem have focused on modeling the whole multi-agent system as a game. Instead, we shall face the problem of prescribing decisions to a single agent (the supported decision maker, DM) against a potential threat model (the adversary). We augment the MDP to account for this threat, introducing Threatened Markov Decision Processes (TMDPs). Furthermore, we propose a level-k thinking scheme resulting in a new learning framework to deal with TMDPs. We empirically test our framework, showing the benefits of opponent modeling.

Download Full-text

Selective auditory attention detection using dynamic learning systems: The study of RNN and reinforcement learning

10.1101/2021.02.18.431748 ◽

2021 ◽

Author(s):

Masoud Geravanchizadeh ◽

Hossein Roushan

Keyword(s):

Reinforcement Learning ◽

Detection System ◽

Auditory Attention ◽

Final Decision ◽

Learning Approaches ◽

Cocktail Party ◽

Dynamic Learning ◽

Learning Stage ◽

Q Learning ◽

Markov Decision

AbstractThe cocktail party phenomenon describes the ability of the human brain to focus auditory attention on a particular stimulus while ignoring other acoustic events. Selective auditory attention detection (SAAD) is an important issue in the development of brain-computer interface systems and cocktail party processors. This paper proposes a new dynamic attention detection system to process the temporal evolution of the input signal. In the proposed dynamic system, after preprocessing of the input signals, the probabilistic state space of the system is formed. Then, in the learning stage, different dynamic learning methods, including recurrent neural network (RNN) and reinforcement learning (Markov decision process (MDP) and deep Q-learning) are applied to make the final decision as to the attended speech. Among different dynamic learning approaches, the evaluation results show that the deep Q-learning approach (MDP+RNN) provides the highest classification accuracy (94.2%) with the least detection delay. The proposed SAAD system is advantageous, in the sense that the detection of attention is performed dynamically for the sequential inputs. Also, the system has the potential to be used in scenarios, where the attention of the listener might be switched in time in the presence of various acoustic events.

Download Full-text

A Novel Deep Reinforcement Learning based Relay Selection for Broadcasting in Vehicular Ad hoc Networks

IEEE Access ◽

10.1109/access.2021.3138903 ◽

2021 ◽

pp. 1-1

Author(s):

Abir McHergui ◽

Tarek Moulahi

Keyword(s):

Reinforcement Learning ◽

Ad Hoc Networks ◽

Relay Selection ◽

Vehicular Ad Hoc Networks ◽

Ad Hoc ◽

Selection For ◽

Hoc Networks

Download Full-text

Q-Learning Based Predictive Relay Selection for Optimal Relay Beamforming

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp40776.2020.9054173 ◽

2020 ◽

Author(s):

Anastasios Dimas ◽

Konstantinos Diamantaras ◽

Athina P. Petropulu

Keyword(s):

Relay Selection ◽

Q Learning ◽

Selection For

Download Full-text

Development and validation of a reinforcement learning algorithm to dynamically optimize mechanical ventilation in critical care

npj Digital Medicine ◽

10.1038/s41746-021-00388-6 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Arne Peine ◽

Ahmed Hallawa ◽

Johannes Bickenbach ◽

Guido Dartmann ◽

Lejla Begic Fazlic ◽

...

Keyword(s):

Mechanical Ventilation ◽

Reinforcement Learning ◽

Critically Ill ◽

Critically Ill Patients ◽

High Performance ◽

Learning Algorithm ◽

Clinical Care ◽

Ideal Body Weight ◽

Markov Decision ◽

Reinforcement Learning Algorithm

AbstractThe aim of this work was to develop and evaluate the reinforcement learning algorithm VentAI, which is able to suggest a dynamically optimized mechanical ventilation regime for critically-ill patients. We built, validated and tested its performance on 11,943 events of volume-controlled mechanical ventilation derived from 61,532 distinct ICU admissions and tested it on an independent, secondary dataset (200,859 ICU stays; 25,086 mechanical ventilation events). A patient “data fingerprint” of 44 features was extracted as multidimensional time series in 4-hour time steps. We used a Markov decision process, including a reward system and a Q-learning approach, to find the optimized settings for positive end-expiratory pressure (PEEP), fraction of inspired oxygen (FiO2) and ideal body weight-adjusted tidal volume (Vt). The observed outcome was in-hospital or 90-day mortality. VentAI reached a significantly increased estimated performance return of 83.3 (primary dataset) and 84.1 (secondary dataset) compared to physicians’ standard clinical care (51.1). The number of recommended action changes per mechanically ventilated patient constantly exceeded those of the clinicians. VentAI chose 202.9% more frequently ventilation regimes with lower Vt (5–7.5 mL/kg), but 50.8% less for regimes with higher Vt (7.5–10 mL/kg). VentAI recommended 29.3% more frequently PEEP levels of 5–7 cm H2O and 53.6% more frequently PEEP levels of 7–9 cmH2O. VentAI avoided high (>55%) FiO2 values (59.8% decrease), while preferring the range of 50–55% (140.3% increase). In conclusion, VentAI provides reproducible high performance by dynamically choosing an optimized, individualized ventilation strategy and thus might be of benefit for critically ill patients.

Download Full-text

Reinforcement Learning with Non-Markovian Rewards

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5814 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3980-3987

Author(s):

Maor Gaon ◽

Ronen Brafman

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Decision Process ◽

Policy Learning ◽

Learning From Experience ◽

World Model ◽

Basic Premise ◽

Q Learning ◽

Markov Decision ◽

Automata Learning

The standard RL world model is that of a Markov Decision Process (MDP). A basic premise of MDPs is that the rewards depend on the last state and action only. Yet, many real-world rewards are non-Markovian. For example, a reward for bringing coffee only if requested earlier and not yet served, is non-Markovian if the state only records current requests and deliveries. Past work considered the problem of modeling and solving MDPs with non-Markovian rewards (NMR), but we know of no principled approaches for RL with NMR. Here, we address the problem of policy learning from experience with such rewards. We describe and evaluate empirically four combinations of the classical RL algorithm Q-learning and R-max with automata learning algorithms to obtain new RL algorithms for domains with NMR. We also prove that some of these variants converge to an optimal policy in the limit.

Download Full-text