Selective network discovery via deep reinforcement learning on embedded spaces

AbstractComplex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks given resource collection constraints are of great interest. In this paper, we formulate the task-specific network discovery problem as a sequential decision-making problem. Our downstream task is selective harvesting, the optimal collection of vertices with a particular attribute. We propose a framework, called network actor critic (NAC), which learns a policy and notion of future reward in an offline setting via a deep reinforcement learning algorithm. The NAC paradigm utilizes a task-specific network embedding to reduce the state space complexity. A detailed comparative analysis of popular network embeddings is presented with respect to their role in supporting offline planning. Furthermore, a quantitative study is presented on various synthetic and real benchmarks using NAC and several baselines. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms. Finally, we outline learning regimes where planning is critical in addressing sparse and changing reward signals.

Download Full-text

A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms

Neural Computation ◽

10.1162/089976699300016070 ◽

1999 ◽

Vol 11 (8) ◽

pp. 2017-2060 ◽

Cited By ~ 70

Author(s):

Csaba Szepesvári ◽

Michael L. Littman

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Learning Algorithm ◽

Learning Algorithms ◽

Sequential Decision ◽

Q Learning ◽

Markov Games ◽

Optimal Behavior ◽

Risk Sensitive ◽

Optimal Value

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.

Download Full-text

Reinforcement learning using quantum Boltzmann machines

Quantum Information and Computation ◽

10.26421/qic18.1-2-3 ◽

2018 ◽

Vol 18 (1&2) ◽

pp. 51-74 ◽

Cited By ~ 2

Author(s):

Daniel Crawford ◽

Anna Levit ◽

Navid Ghadermarzy ◽

Jaspreet S. Oberoi ◽

Pooya Ronagh

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Transverse Field ◽

Boltzmann Machine ◽

Quantum Annealing ◽

Restricted Boltzmann Machines ◽

Boltzmann Machines ◽

Ising Spin ◽

Learning Tasks ◽

Deep Boltzmann Machine

We investigate whether quantum annealers with select chip layouts can outperform classical computers in reinforcement learning tasks. We associate a transverse field Ising spin Hamiltonian with a layout of qubits similar to that of a deep Boltzmann machine (DBM) and use simulated quantum annealing (SQA) to numerically simulate quantum sampling from this system. We design a reinforcement learning algorithm in which the set of visible nodes representing the states and actions of an optimal policy are the first and last layers of the deep network. In absence of a transverse field, our simulations show that DBMs are trained more effectively than restricted Boltzmann machines (RBM) with the same number of nodes. We then develop a framework for training the network as a quantum Boltzmann machine (QBM) in the presence of a significant transverse field for reinforcement learning. This method also outperforms the reinforcement learning method that uses RBMs.

Download Full-text

A Novel Adaptive Sampling Strategy for Deep Reinforcement Learning

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026821500115 ◽

2021 ◽

Vol 20 (02) ◽

pp. 2150011

Author(s):

Xingxing Liang ◽

Li Chen ◽

Yanghe Feng ◽

Zhong Liu ◽

Yang Ma ◽

...

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Adaptive Sampling ◽

Learning Algorithm ◽

Sampling Strategy ◽

Sequential Decision ◽

Fixed Temperature ◽

Sample Distribution ◽

Intelligent Decision Making ◽

Experience Replay

Reinforcement learning, as an effective method to solve complex sequential decision-making problems, plays an important role in areas such as intelligent decision-making and behavioral cognition. It is well known that the sample experience replay mechanism contributes to the development of current deep reinforcement learning by reusing past samples to improve the efficiency of samples. However, the existing priority experience replay mechanism changes the sample distribution in the sample set due to the higher sampling frequency assigned to a specific transition, and it cannot be applied to actor-critic and other on-policy reinforcement learning algorithm. To address this, we propose an adaptive factor based on TD-error, which further increases sample utilization by giving more attention weight to samples of larger TD-error, and embeds it flexibly into the original Deep Q Network and Advantage Actor-Critic algorithm to improve their performance. Then we carried out the performance evaluation for the proposed architecture in the context of CartPole-V1 and 6 environments of Atari game experiments, respectively, and the obtained results either on the conditions of fixed temperature or annealing temperature, when compared to those produced by the vanilla DQN and original A2C, highlight the advantages in cumulative rewards and climb speed of the improved algorithms.

Download Full-text

A robust policy bootstrapping algorithm for multi-objective reinforcement learning in non-stationary environments

Adaptive Behavior ◽

10.1177/1059712319869313 ◽

2019 ◽

Vol 28 (4) ◽

pp. 273-292 ◽

Cited By ~ 1

Author(s):

Sherif Abdelfattah ◽

Kathryn Kasmarik ◽

Jiankun Hu

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Markov Property ◽

Special Kind ◽

Optimization Techniques ◽

Optimization Approach ◽

Multi Objective Optimization ◽

Sequential Decision ◽

Major Drawback ◽

Multi Objective

Multi-objective Markov decision processes are a special kind of multi-objective optimization problem that involves sequential decision making while satisfying the Markov property of stochastic processes. Multi-objective reinforcement learning methods address this kind of problem by fusing the reinforcement learning paradigm with multi-objective optimization techniques. One major drawback of these methods is the lack of adaptability to non-stationary dynamics in the environment. This is because they adopt optimization procedures that assume stationarity in order to evolve a coverage set of policies that can solve the problem. This article introduces a developmental optimization approach that can evolve the policy coverage set while exploring the preference space over the defined objectives in an online manner. We propose a novel multi-objective reinforcement learning algorithm that can robustly evolve a convex coverage set of policies in an online manner in non-stationary environments. We compare the proposed algorithm with two state-of-the-art multi-objective reinforcement learning algorithms in stationary and non-stationary environments. Results showed that the proposed algorithm significantly outperforms the existing algorithms in non-stationary environments while achieving comparable results in stationary environments.

Download Full-text

Fog Fragment Cooperation on Bandwidth Management Based on Reinforcement Learning

Sensors ◽

10.3390/s20236942 ◽

2020 ◽

Vol 20 (23) ◽

pp. 6942

Author(s):

Motahareh Mobasheri ◽

Yangwoo Kim ◽

Woongsup Kim

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Fog Computing ◽

Learning Approach ◽

Data Generation ◽

Bandwidth Management ◽

Q Learning ◽

Decision Making Problem ◽

Iot Devices ◽

The Internet Of Things

The term big data has emerged in network concepts since the Internet of Things (IoT) made data generation faster through various smart environments. In contrast, bandwidth improvement has been slower; therefore, it has become a bottleneck, creating the need to solve bandwidth constraints. Over time, due to smart environment extensions and the increasing number of IoT devices, the number of fog nodes has increased. In this study, we introduce fog fragment computing in contrast to conventional fog computing. We address bandwidth management using fog nodes and their cooperation to overcome the extra required bandwidth for IoT devices with emergencies and bandwidth limitations. We formulate the decision-making problem of the fog nodes using a reinforcement learning approach and develop a Q-learning algorithm to achieve efficient decisions by forcing the fog nodes to help each other under special conditions. To the best of our knowledge, there has been no research with this objective thus far. Therefore, we compare this study with another scenario that considers a single fog node to show that our new extended method performs considerably better.

Download Full-text

Leveraging Human Guidance for Deep Reinforcement Learning Tasks

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/884 ◽

2019 ◽

Author(s):

Ruohan Zhang ◽

Faraz Torabi ◽

Lin Guan ◽

Dana H. Ballard ◽

Peter Stone

Keyword(s):

Reinforcement Learning ◽

Future Research ◽

Sequential Decision ◽

Research Directions ◽

Learning Agents ◽

Learning Tasks ◽

Future Research Directions ◽

Human Effort ◽

High Level ◽

Learning Frameworks

Reinforcement learning agents can learn to solve sequential decision tasks by interacting with the environment. Human knowledge of how to solve these tasks can be incorporated using imitation learning, where the agent learns to imitate human demonstrated decisions. However, human guidance is not limited to the demonstrations. Other types of guidance could be more suitable for certain tasks and require less human effort. This survey provides a high-level overview of five recent learning frameworks that primarily rely on human guidance other than conventional, step-by-step action demonstrations. We review the motivation, assumption, and implementation of each framework. We then discuss possible future research directions.

Download Full-text

Reward-Free Reinforcement Learning Algorithm Using Prediction Network

Fuzzy Systems and Data Mining VI - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200744 ◽

2020 ◽

Author(s):

Zhen Yu ◽

Yimin Feng ◽

Lijun Liu

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Value Functions ◽

Learning Method ◽

Reward Function ◽

Network Training ◽

Learning Tasks ◽

Reward Value ◽

Policy Gradient ◽

Reward Functions

In general reinforcement learning tasks, the formulation of reward functions is a very important step in reinforcement learning. The reward function is not easy to formulate in a large number of systems. The network training effect is sensitive to the reward function, and different reward value functions will get different results. For a class of systems that meet specific conditions, the traditional reinforcement learning method is improved. A state quantity function is designed to replace the reward function, which is more efficient than the traditional reward function. At the same time, the predictive network link is designed so that the network can learn the value of the general state by using the special state. The overall structure of the network will be improved based on the Deep Deterministic Policy Gradient (DDPG) algorithm. Finally, the algorithm was successfully applied in the environment of FrozenLake, and achieved good performance. The experiment proves the effectiveness of the algorithm and realizes rewardless reinforcement learning in a class of systems.

Download Full-text

Deep Reinforcement Learning Versus Evolution Strategies: A Comparative Survey

10.36227/techrxiv.14679504 ◽

2021 ◽

Author(s):

Amjad Majid

Keyword(s):

Reinforcement Learning ◽

Evolution Strategies ◽

Sequential Decision ◽

Level Control ◽

Learning Tasks ◽

Agent Learning ◽

Multi Agent ◽

Comparative Survey ◽

Different Characteristics ◽

Specific Learning

<div>Deep Reinforcement Learning (DRL) has the potential to surpass human-level control in sequential decision-making problems. Evolution Strategies (ESs) have different characteristics than DRL, yet they are promoted as a scalable alternative. </div><div>To get insights into their strengths and weaknesses, in this paper, we put the two approaches side by side. After presenting the fundamental concepts and algorithms for each of the two approaches, they are compared from the perspectives of scalability, exploration, adaptation to dynamic environments, and multi-agent learning. Then, the paper discusses hybrid algorithms, combining aspects of both DRL and ESs, and how they attempt to capitalize on the benefits of both techniques. Lastly, both approaches are compared based on the set of applications they support, showing their potential for tackling real-world problems.</div><div>This paper aims to present an overview of how DRL and ESs can be used, either independently or in unison, to solve specific learning tasks. It is intended to guide researchers to select which method suits them best and provides a bird's eye view of the overall literature in the field. Further, we also provide application scenarios and open challenges. </div>

Download Full-text

Deep Reinforcement Learning for Energy Microgrids Management Considering Flexible Energy Sources

EPJ Web of Conferences ◽

10.1051/epjconf/201921701016 ◽

2019 ◽

Vol 217 ◽

pp. 01016 ◽

Cited By ~ 2

Author(s):

Nikita Tomin ◽

Alexey Zhukov ◽

Alexander Domyshev

Keyword(s):

Reinforcement Learning ◽

Electricity Consumption ◽

Energy Sources ◽

Sequential Decision ◽

Time Step ◽

Long Term Storage ◽

Decision Making Problem ◽

Short And Long Term ◽

Term Storage

The problem of optimally activating the flexible energy sources (short- and long-term storage capacities) of electricity microgrid is formulated as a sequential decision making problem under uncertainty where, at every time-step, the uncertainty comes from the lack of knowledge about future electricity consumption and weather dependent PV production. This paper proposes to address this problem using deep reinforcement learning. To this purpose, a specific deep learning architecture has been used in order to extract knowledge from past consumption and production time series as well as any available forecasts. The approach is empirically illustrated in the case of off-grid microgrids located in Belgium and Russia.

Download Full-text

The biobjective multiarmed bandit: learning approximate lexicographic optimal allocations

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES ◽

10.3906/elk-1806-221 ◽

2019 ◽

Author(s):

CEM TEKİN

Keyword(s):

Cognitive Radio Network ◽

Learning Algorithm ◽

Primary User ◽

Radio Network ◽

Sequential Decision ◽

Proposed Model ◽

Decision Making Problem ◽

Optimal Allocations ◽

User Throughput ◽

User Interference

We consider a biobjective sequential decision-making problem where an allocation (arm) is called ε lexi- cographic optimal if its expected reward in the first objective is at most ε smaller than the highest expected reward, and its expected reward in the second objective is at least the expected reward of a lexicographic optimal arm. The goal of the learner is to select arms that are ε lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. For this problem, we first show that the learner’s goal is equivalent to minimizing the ε lexicographic regret, and then, propose a learning algorithm whose ε lexicographic gap-dependent regret is bounded and gap-independent regret is sublinear in the number of rounds with high probability. Then, we apply the proposed model and algorithm for dynamic rate and channel selection in a cognitive radio network with imperfect channel sensing. Our results show that the proposed algorithm is able to learn the approximate lexicographic optimal rate–channel pair that simultaneously minimizes the primary user interference and maximizes the secondary user throughput.

Download Full-text