Multi-Agent Distributed Deep Deterministic Policy Gradient for Partially Observable Tracking

In many existing multi-agent reinforcement learning tasks, each agent observes all the other agents from its own perspective. In addition, the training process is centralized, namely the critic of each agent can access the policies of all the agents. This scheme has certain limitations since every single agent can only obtain the information of its neighbor agents due to the communication range in practical applications. Therefore, in this paper, a multi-agent distributed deep deterministic policy gradient (MAD3PG) approach is presented with decentralized actors and distributed critics to realize multi-agent distributed tracking. The distinguishing feature of the proposed framework is that we adopted the multi-agent distributed training with decentralized execution, where each critic only takes the agent’s and the neighbor agents’ policies into account. Experiments were conducted in the distributed tracking tasks based on multi-agent particle environments where N(N=3,N=5) agents track a target agent with partial observation. The results showed that the proposed method achieves a higher reward with a shorter training time compared to other methods, including MADDPG, DDPG, PPO, and DQN. The proposed novel method leads to a more efficient and effective multi-agent tracking.

Download Full-text

Exploring communication protocols and centralized critics in multi-agent deep learning

Integrated Computer-Aided Engineering ◽

10.3233/ica-200631 ◽

2020 ◽

Vol 27 (4) ◽

pp. 333-351

Author(s):

David Simões ◽

Nuno Lau ◽

Luís Paulo Reis

Keyword(s):

Deep Learning ◽

Human Performance ◽

Learning Algorithm ◽

Single Agent ◽

Communication Protocols ◽

Relevant Information ◽

Team Members ◽

Deep Learning Algorithm ◽

Multi Agent ◽

Partially Observable

Tackling multi-agent environments where each agent has a local limited observation of the global state is a non-trivial task that often requires hand-tuned solutions. A team of agents coordinating in such scenarios must handle the complex underlying environment, while each agent only has partial knowledge about the environment. Deep reinforcement learning has been shown to achieve super-human performance in single-agent environments, and has since been adapted to the multi-agent paradigm. This paper proposes A3C3, a multi-agent deep learning algorithm, where agents are evaluated by a centralized referee during the learning phase, but remain independent from each other in actual execution. This referee’s neural network is augmented with a permutation invariance architecture to increase its scalability to large teams. A3C3 also allows agents to learn communication protocols with which agents share relevant information to their team members, allowing them to overcome their limited knowledge, and achieve coordination. A3C3 and its permutation invariant augmentation is evaluated in multiple multi-agent test-beds, which include partially-observable scenarios, swarm environments, and complex 3D soccer simulations.

Download Full-text

Improving Policies via Search in Cooperative Partially Observable Games

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6208 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7187-7194

Author(s):

Adam Lerer ◽

Hengyuan Hu ◽

Jakob Foerster ◽

Noam Brown

Keyword(s):

Social Interactions ◽

State Of The Art ◽

Common Knowledge ◽

Single Agent ◽

Approximation Error ◽

Original Performance ◽

Search Technique ◽

Multi Agent ◽

Zero Sum ◽

Partially Observable

Recent superhuman results in games have largely been achieved in a variety of zero-sum settings, such as Go and Poker, in which agents need to compete against others. However, just like humans, real-world AI systems have to coordinate and communicate with other agents in cooperative partially observable environments as well. These settings commonly require participants to both interpret the actions of others and to act in a way that is informative when being interpreted. Those abilities are typically summarized as theory of mind and are seen as crucial for social interactions. In this paper we propose two different search techniques that can be applied to improve an arbitrary agreed-upon policy in a cooperative partially observable game. The first one, single-agent search, effectively converts the problem into a single agent setting by making all but one of the agents play according to the agreed-upon policy. In contrast, in multi-agent search all agents carry out the same common-knowledge search procedure whenever doing so is computationally feasible, and fall back to playing according to the agreed-upon policy otherwise. We prove that these search procedures are theoretically guaranteed to at least maintain the original performance of the agreed-upon policy (up to a bounded approximation error). In the benchmark challenge problem of Hanabi, our search technique greatly improves the performance of every agent we tested and when applied to a policy trained using RL achieves a new state-of-the-art score of 24.61 / 25 in the game, compared to a previous-best of 24.08 / 25.

Download Full-text

SMIX(λ): Enhancing Centralized Value Functions for Cooperative Multi-Agent Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6223 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7301-7308

Author(s):

Chao Wen ◽

Xinghu Yao ◽

Yuhui Wang ◽

Xiaoyang Tan

Keyword(s):

Reinforcement Learning ◽

Single Agent ◽

Value Functions ◽

Type Method ◽

Agent Learning ◽

Experience Replay ◽

Multi Agent ◽

Decentralized Execution ◽

Single Agent Learning

This work presents a sample efficient and effective value-based method, named SMIX(λ), for reinforcement learning in multi-agent environments (MARL) within the paradigm of centralized training with decentralized execution (CTDE), in which learning a stable and generalizable centralized value function (CVF) is crucial. To achieve this, our method carefully combines different elements, including 1) removing the unrealistic centralized greedy assumption during the learning phase, 2) using the λ-return to balance the trade-off between bias and variance and to deal with the environment's non-Markovian property, and 3) adopting an experience-replay style off-policy training. Interestingly, it is revealed that there exists inherent connection between SMIX(λ) and previous off-policy Q(λ) approach for single-agent learning. Experiments on the StarCraft Multi-Agent Challenge (SMAC) benchmark show that the proposed SMIX(λ) algorithm outperforms several state-of-the-art MARL methods by a large margin, and that it can be used as a general tool to improve the overall performance of a CTDE-type method by enhancing the evaluation quality of its CVF. We open-source our code at: https://github.com/chaovven/SMIX.

Download Full-text

Message-Dropout: An Efficient Training Method for Multi-Agent Deep Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016079 ◽

2019 ◽

Vol 33 ◽

pp. 6079-6086 ◽

Cited By ~ 1

Author(s):

Woojun Kim ◽

Myungsik Cho ◽

Youngchul Sung

Keyword(s):

Reinforcement Learning ◽

Dropout Rate ◽

Learning Performance ◽

Multi Agent Systems ◽

Message Communication ◽

Execution Phase ◽

New Learning ◽

Policy Gradient ◽

Multi Agent ◽

Decentralized Execution

In this paper, we propose a new learning technique named message-dropout to improve the performance for multi-agent deep reinforcement learning under two application scenarios: 1) classical multi-agent reinforcement learning with direct message communication among agents and 2) centralized training with decentralized execution. In the first application scenario of multi-agent systems in which direct message communication among agents is allowed, the messagedropout technique drops out the received messages from other agents in a block-wise manner with a certain probability in the training phase and compensates for this effect by multiplying the weights of the dropped-out block units with a correction probability. The applied message-dropout technique effectively handles the increased input dimension in multi-agent reinforcement learning with communication and makes learning robust against communication errors in the execution phase. In the second application scenario of centralized training with decentralized execution, we particularly consider the application of the proposed messagedropout to Multi-Agent Deep Deterministic Policy Gradient (MADDPG), which uses a centralized critic to train a decentralized actor for each agent. We evaluate the proposed message-dropout technique for several games, and numerical results show that the proposed message-dropout technique with proper dropout rate improves the reinforcement learning performance significantly in terms of the training speed and the steady-state performance in the execution phase.

Download Full-text

A Multi-Agent Reinforcement Learning Approach to Price and Comfort Optimization in HVAC-Systems

Energies ◽

10.3390/en14227491 ◽

2021 ◽

Vol 14 (22) ◽

pp. 7491

Author(s):

Christian Blad ◽

Simon Bøgh ◽

Carsten Kallesøe

Keyword(s):

Reinforcement Learning ◽

Control Method ◽

Single Agent ◽

Electric Energy ◽

Heating System ◽

Hvac Systems ◽

Hvac System ◽

Data Set ◽

Training Time ◽

Multi Agent

This paper addresses the challenge of minimizing training time for the control of Heating, Ventilation, and Air-conditioning (HVAC) systems with online Reinforcement Learning (RL). This is done by developing a novel approach to Multi-Agent Reinforcement Learning (MARL) to HVAC systems. In this paper, the environment formed by the HVAC system is formulated as a Markov Game (MG) in a general sum setting. The MARL algorithm is designed in a decentralized structure, where only relevant states are shared between agents, and actions are shared in a sequence, which are sensible from a system’s point of view. The simulation environment is a domestic house located in Denmark and designed to resemble an average house. The heat source in the house is an air-to-water heat pump, and the HVAC system is an Underfloor Heating system (UFH). The house is subjected to weather changes from a data set collected in Copenhagen in 2006, spanning the entire year except for June, July, and August, where heat is not required. It is shown that: (1) When comparing Single Agent Reinforcement Learning (SARL) and MARL, training time can be reduced by 70% for a four temperature-zone UFH system, (2) the agent can learn and generalize over seasons, (3) the cost of heating can be reduced by 19% or the equivalent to 750 kWh of electric energy per year for an average Danish domestic house compared to a traditional control method, and (4) oscillations in the room temperature can be reduced by 40% when comparing the RL control methods with a traditional control method.

Download Full-text

Multi-Agent Reinforcement Learning: A Review of Challenges and Applications

Applied Sciences ◽

10.3390/app11114948 ◽

2021 ◽

Vol 11 (11) ◽

pp. 4948

Author(s):

Lorenzo Canese ◽

Gian Carlo Cardarilli ◽

Luca Di Di Nunzio ◽

Rocco Fazzolari ◽

Daniele Giardino ◽

...

Keyword(s):

Reinforcement Learning ◽

Mathematical Models ◽

Learning Algorithms ◽

Single Agent ◽

Critical Issues ◽

Multi Agent ◽

Pros And Cons ◽

Application Fields

In this review, we present an analysis of the most used multi-agent reinforcement learning algorithms. Starting with the single-agent reinforcement learning algorithms, we focus on the most critical issues that must be taken into account in their extension to multi-agent scenarios. The analyzed algorithms were grouped according to their features. We present a detailed taxonomy of the main multi-agent approaches proposed in the literature, focusing on their related mathematical models. For each algorithm, we describe the possible application fields, while pointing out its pros and cons. The described multi-agent algorithms are compared in terms of the most important characteristics for multi-agent reinforcement learning applications—namely, nonstationarity, scalability, and observability. We also describe the most common benchmark environments used to evaluate the performances of the considered methods.

Download Full-text

A projective simulation scheme for partially observable multi-agent systems

Quantum Machine Intelligence ◽

10.1007/s42484-021-00037-x ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Rasoul Kheiri

Keyword(s):

Multi Agent Systems ◽

Agent Systems ◽

Multi Agent ◽

Partially Observable ◽

Simulation Scheme

Download Full-text

Safe option-critic: learning safety in the option-critic architecture

The Knowledge Engineering Review ◽

10.1017/s0269888921000035 ◽

2021 ◽

Vol 36 ◽

Author(s):

Arushi Jain ◽

Khimya Khetarpal ◽

Doina Precup

Keyword(s):

Model Uncertainty ◽

Gradient Algorithm ◽

Intrinsic Variability ◽

Expected Return ◽

Practical Applications ◽

Hierarchical Reinforcement Learning ◽

Continuous State ◽

End Conditions ◽

Policy Gradient ◽

High Uncertainty

Abstract Designing hierarchical reinforcement learning algorithms that exhibit safe behaviour is not only vital for practical applications but also facilitates a better understanding of an agent’s decisions. We tackle this problem in the options framework (Sutton, Precup & Singh, 1999), a particular way to specify temporally abstract actions which allow an agent to use sub-policies with start and end conditions. We consider a behaviour as safe that avoids regions of state space with high uncertainty in the outcomes of actions. We propose an optimization objective that learns safe options by encouraging the agent to visit states with higher behavioural consistency. The proposed objective results in a trade-off between maximizing the standard expected return and minimizing the effect of model uncertainty in the return. We propose a policy gradient algorithm to optimize the constrained objective function. We examine the quantitative and qualitative behaviours of the proposed approach in a tabular grid world, continuous-state puddle world, and three games from the Arcade Learning Environment: Ms. Pacman, Amidar, and Q*Bert. Our approach achieves a reduction in the variance of return, boosts performance in environments with intrinsic variability in the reward structure, and compares favourably both with primitive actions and with risk-neutral options.

Download Full-text

Hierarchical Reinforcement Learning

ACM Computing Surveys ◽

10.1145/3453160 ◽

2021 ◽

Vol 54 (5) ◽

pp. 1-35

Author(s):

Shubham Pateria ◽

Budhitama Subagdja ◽

Ah-hwee Tan ◽

Chai Quek

Keyword(s):

Reinforcement Learning ◽

Future Research ◽

Comprehensive Overview ◽

Open Problems ◽

Practical Applications ◽

Hierarchical Reinforcement Learning ◽

The Past ◽

Agent Learning ◽

Multi Agent ◽

Supplementary Material

Hierarchical Reinforcement Learning (HRL) enables autonomous decomposition of challenging long-horizon decision-making tasks into simpler subtasks. During the past years, the landscape of HRL research has grown profoundly, resulting in copious approaches. A comprehensive overview of this vast landscape is necessary to study HRL in an organized manner. We provide a survey of the diverse HRL approaches concerning the challenges of learning hierarchical policies, subtask discovery, transfer learning, and multi-agent learning using HRL. The survey is presented according to a novel taxonomy of the approaches. Based on the survey, a set of important open problems is proposed to motivate the future research in HRL. Furthermore, we outline a few suitable task domains for evaluating the HRL approaches and a few interesting examples of the practical applications of HRL in the Supplementary Material.

Download Full-text