S$$^{2}$$ES: a stationary and scalable knowledge transfer approach for multiagent reinforcement learning

AbstractKnowledge transfer is widely adopted in accelerating multiagent reinforcement learning (MARL). To accelerate the learning speed of MARL for learning-from scratch agents, in this paper, we propose a Stationary and Scalable knowledge transfer approach based on Experience Sharing (S$$^{2}$$ 2 ES). The mainframe of our approach is structured into three components: what kind of experience, how to learn, and when to transfer. Specifically, we first design an augmented form of experience. By sharing (i.e., transmitting) the experience from one agent to its peers, the learning speed can be effectively enhanced with guaranteed scalability. A synchronized learning pattern is then adopted, which reduces the nonstationarity brought by experience replay, and at the same time retains data efficiency. Moreover, to avoid redundant transfer when the agents’ policies have converged, we further design two trigger conditions, one is modified Q value-based and another is normalized Shannon entropy-based, to determine when to conduct experience sharing. Empirical studies indicate that the proposed approach outperforms the other knowledge transfer methods in efficacy, efficiency, and scalability. We also provide ablation experiments to demonstrate the necessity of the key ingredients.

Download Full-text

A multiagent reinforcement learning method based on the model inference of the other agents

Systems and Computers in Japan ◽

10.1002/scj.10110 ◽

2002 ◽

Vol 33 (12) ◽

pp. 67-76

Author(s):

Yoichiro Matsuno ◽

Tatsuya Yamazaki ◽

Jun Matsuda ◽

Shin Ishii

Keyword(s):

Reinforcement Learning ◽

The Other ◽

Learning Method ◽

Model Inference ◽

Multiagent Reinforcement Learning

Download Full-text

Pre-training with non-expert human demonstration for deep reinforcement learning

The Knowledge Engineering Review ◽

10.1017/s0269888919000055 ◽

2019 ◽

Vol 34 ◽

Cited By ~ 1

Author(s):

Gabriel V. de la Cruz ◽

Yunshu Du ◽

Matthew E. Taylor

Keyword(s):

Reinforcement Learning ◽

Feature Learning ◽

Feature Representation ◽

Superior Performance ◽

Learning Goals ◽

Training Time ◽

Learning Speed ◽

Small Set ◽

Data Efficiency ◽

Human Demonstration

Abstract Deep reinforcement learning (deep RL) has achieved superior performance in complex sequential tasks by using deep neural networks as function approximators to learn directly from raw input images. However, learning directly from raw images is data inefficient. The agent must learn feature representation of complex states in addition to learning a policy. As a result, deep RL typically suffers from slow learning speeds and often requires a prohibitively large amount of training time and data to reach reasonable performance, making it inapplicable to real-world settings where data are expensive. In this work, we improve data efficiency in deep RL by addressing one of the two learning goals, feature learning. We leverage supervised learning to pre-train on a small set of non-expert human demonstrations and empirically evaluate our approach using the asynchronous advantage actor-critic algorithms in the Atari domain. Our results show significant improvements in learning speed, even when the provided demonstration is noisy and of low quality.

Download Full-text

Efficient hindsight reinforcement learning using demonstrations for robotic tasks with sparse rewards

International Journal of Advanced Robotic Systems ◽

10.1177/1729881419898342 ◽

2020 ◽

Vol 17 (1) ◽

pp. 172988141989834

Author(s):

Guoyu Zuo ◽

Qishen Zhao ◽

Jiahao Lu ◽

Jiangeng Li

Keyword(s):

Reinforcement Learning ◽

Gradient Algorithm ◽

Learning To Learn ◽

Model Free ◽

Learning Speed ◽

Policy Gradient ◽

Experience Replay ◽

Speed Up ◽

Reward Functions ◽

Robotic Tasks

The goal of reinforcement learning is to enable an agent to learn by using rewards. However, some robotic tasks naturally specify with sparse rewards, and manually shaping reward functions is a difficult project. In this article, we propose a general and model-free approach for reinforcement learning to learn robotic tasks with sparse rewards. First, a variant of Hindsight Experience Replay, Curious and Aggressive Hindsight Experience Replay, is proposed to improve the sample efficiency of reinforcement learning methods and avoid the need for complicated reward engineering. Second, based on Twin Delayed Deep Deterministic policy gradient algorithm, demonstrations are leveraged to overcome the exploration problem and speed up the policy training process. Finally, the action loss is added into the loss function in order to minimize the vibration of output action while maximizing the value of the action. The experiments on simulated robotic tasks are performed with different hyperparameters to verify the effectiveness of our method. Results show that our method can effectively solve the sparse reward problem and obtain a high learning speed.

Download Full-text

Multiagent Reinforcement Learning With Sparse Interactions by Negotiation and Knowledge Transfer

IEEE Transactions on Cybernetics ◽

10.1109/tcyb.2016.2543238 ◽

2017 ◽

Vol 47 (5) ◽

pp. 1238-1250 ◽

Cited By ~ 17

Author(s):

Luowei Zhou ◽

Pei Yang ◽

Chunlin Chen ◽

Yang Gao

Keyword(s):

Reinforcement Learning ◽

Knowledge Transfer ◽

Multiagent Reinforcement Learning

Download Full-text

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i10.7247 ◽

2020 ◽

Vol 34 (10) ◽

pp. 13949-13950

Author(s):

Wang Qisheng ◽

Wang Qichao ◽

Li Xiao

Keyword(s):

Reinforcement Learning ◽

Supervised Learning ◽

Experimental Results ◽

State Action ◽

Reward Function ◽

Current State ◽

Learning Speed ◽

Communication Method ◽

Experience Replay ◽

Multi Agent

Exploration efficiency challenges for multi-agent reinforcement learning (MARL), as the policy learned by confederate MARL depends on the interaction among agents. Less informative reward also restricts the learning speed of MARL in comparison with the informative label in supervised learning. This paper proposes a novel communication method which helps agents focus on different exploration subarea to guide MARL to accelerate exploration. We propose a predictive network to forecast the reward of current state-action pair and use the guidance learned by the predictive network to modify the reward function. An improved prioritized experience replay is employed to help agents better take advantage of the different knowledge learned by different agents. Experimental results demonstrate that the proposed algorithm outperforms existing methods in cooperative multi-agent environments.

Download Full-text

Opponent portrait for multiagent reinforcement learning in competitive environment

International Journal of Intelligent Systems ◽

10.1002/int.22594 ◽

2021 ◽

Author(s):

Yuxi Ma ◽

Meng Shen ◽

Yuhang Zhao ◽

Zhao Li ◽

Xiaoyao Tong ◽

...

Keyword(s):

Reinforcement Learning ◽

Competitive Environment ◽

Multiagent Reinforcement Learning

Download Full-text

GateRL: Automated Circuit Design Framework of CMOS Logic Gates Using Reinforcement Learning

Electronics ◽

10.3390/electronics10091032 ◽

2021 ◽

Vol 10 (9) ◽

pp. 1032

Author(s):

Hyoungsik Nam ◽

Young In Kim ◽

Jina Bae ◽

Junhee Lee

Keyword(s):

Reinforcement Learning ◽

Circuit Design ◽

Logic Gates ◽

Connection Matrix ◽

Design Framework ◽

Network Layers ◽

Learning Speed ◽

Fully Connected ◽

Automated Circuit Design ◽

Circuit Configuration

This paper proposes a GateRL that is an automated circuit design framework of CMOS logic gates based on reinforcement learning. Because there are constraints in the connection of circuit elements, the action masking scheme is employed. It also reduces the size of the action space leading to the improvement on the learning speed. The GateRL consists of an agent for the action and an environment for state, mask, and reward. State and reward are generated from a connection matrix that describes the current circuit configuration, and the mask is obtained from a masking matrix based on constraints and current connection matrix. The action is given rise to by the deep Q-network of 4 fully connected network layers in the agent. In particular, separate replay buffers are devised for success transitions and failure transitions to expedite the training process. The proposed network is trained with 2 inputs, 1 output, 2 NMOS transistors, and 2 PMOS transistors to design all the target logic gates, such as buffer, inverter, AND, OR, NAND, and NOR. Consequently, the GateRL outputs one-transistor buffer, two-transistor inverter, two-transistor AND, two-transistor OR, three-transistor NAND, and three-transistor NOR. The operations of these resultant logics are verified by the SPICE simulation.

Download Full-text

Experience Sharing Based Memetic Transfer Learning for Multiagent Reinforcement Learning

Memetic Computing ◽

10.1007/s12293-021-00339-4 ◽

2021 ◽

Author(s):

Tonghao Wang ◽

Xingguang Peng ◽

Yaochu Jin ◽

Demin Xu

Keyword(s):

Reinforcement Learning ◽

Transfer Learning ◽

Multiagent Reinforcement Learning

Download Full-text

Synthetic Experiences for Accelerating DQN Performance in Discrete Non-Deterministic Environments

Algorithms ◽

10.3390/a14080226 ◽

2021 ◽

Vol 14 (8) ◽

pp. 226

Author(s):

Wenzel Pilar von Pilchau ◽

Anthony Stein ◽

Jörg Hähner

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Weighted Average ◽

Up States ◽

Experience Replay

State-of-the-art Deep Reinforcement Learning Algorithms such as DQN and DDPG use the concept of a replay buffer called Experience Replay. The default usage contains only the experiences that have been gathered over the runtime. We propose a method called Interpolated Experience Replay that uses stored (real) transitions to create synthetic ones to assist the learner. In this first approach to this field, we limit ourselves to discrete and non-deterministic environments and use a simple equally weighted average of the reward in combination with observed follow-up states. We could demonstrate a significantly improved overall mean average in comparison to a DQN network with vanilla Experience Replay on the discrete and non-deterministic FrozenLake8x8-v0 environment.

Download Full-text

A real-time HIL control system on rotary inverted pendulum hardware platform based on double deep Q-network

Measurement and Control ◽

10.1177/00202940211000380 ◽

2021 ◽

Vol 54 (3-4) ◽

pp. 417-428

Author(s):

Yanyan Dai ◽

KiDong Lee ◽

SukGyu Lee

Keyword(s):

Control System ◽

Reinforcement Learning ◽

Inverted Pendulum ◽

Learning Algorithm ◽

Deep Understanding ◽

Control Engineering ◽

Experience Replay ◽

Real Hardware ◽

Rotary Inverted Pendulum ◽

Reinforcement Learning Algorithm

For real applications, rotary inverted pendulum systems have been known as the basic model in nonlinear control systems. If researchers have no deep understanding of control, it is difficult to control a rotary inverted pendulum platform using classic control engineering models, as shown in section 2.1. Therefore, without classic control theory, this paper controls the platform by training and testing reinforcement learning algorithm. Many recent achievements in reinforcement learning (RL) have become possible, but there is a lack of research to quickly test high-frequency RL algorithms using real hardware environment. In this paper, we propose a real-time Hardware-in-the-loop (HIL) control system to train and test the deep reinforcement learning algorithm from simulation to real hardware implementation. The Double Deep Q-Network (DDQN) with prioritized experience replay reinforcement learning algorithm, without a deep understanding of classical control engineering, is used to implement the agent. For the real experiment, to swing up the rotary inverted pendulum and make the pendulum smoothly move, we define 21 actions to swing up and balance the pendulum. Comparing Deep Q-Network (DQN), the DDQN with prioritized experience replay algorithm removes the overestimate of Q value and decreases the training time. Finally, this paper shows the experiment results with comparisons of classic control theory and different reinforcement learning algorithms.

Download Full-text