Reinforcement learning in discrete action space applied to inverse defect design

The control of shared energy assets within building clusters has traditionally been confined to a discrete action space, owing in part to a computationally intractable decision space. In this work, we leverage the current state of the art in reinforcement learning (RL) for continuous control tasks, the deep deterministic policy gradient (DDPG) algorithm, toward addressing this limitation. The goals of this paper are twofold: (i) to design an efficient charged/discharged dispatch policy for a shared battery system within a building cluster and (ii) to address the continuous domain task of determining how much energy should be charged/discharged at each decision cycle. Experimentally, our results demonstrate an ability to exploit factors such as energy arbitrage, along with the continuous action space toward demand peak minimization. This approach is shown to be computationally tractable, achieving efficient results after only 5 h of simulation. Additionally, the agent showed an ability to adapt to different building clusters, designing unique control strategies to address the energy demands of the clusters studied.

Download Full-text

Action-specialized expert ensemble trading system with extended discrete action space using deep reinforcement learning

PLoS ONE ◽

10.1371/journal.pone.0236178 ◽

2020 ◽

Vol 15 (7) ◽

pp. e0236178 ◽

Cited By ~ 1

Author(s):

JoonBum Leem ◽

Ha Young Kim

Keyword(s):

Reinforcement Learning ◽

Action Space ◽

Trading System ◽

Discrete Action

Download Full-text

Spike neuron optimization using deep reinforcement learning

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i1.pp175-183 ◽

2021 ◽

Vol 10 (1) ◽

pp. 175

Author(s):

Tan Szi Hui ◽

Mohamad Khairi Ishak

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Firing Rate ◽

Action Space ◽

Percentage Error ◽

Average Percentage ◽

Target Neuron ◽

Inhibitory Population ◽

Discrete Action ◽

Average Percentage Error

Deep reinforcement learning (DRL) which involved reinforcement learning and artificial neural network allows agents to take the best possible actions to achieve goals. Spiking Neural Network (SNN) faced difficulty in training due to the non-differentiable spike function of spike neuron. In order to overcome the difficulty, Deep Q network (DQN) and Deep Q learning with normalized advantage function (NAF) are proposed to interact with a custom environment. DQN is applied for discrete action space whereas NAF is implemented for continuous action space. The model is trained and tested to validate its performance in order to balance the firing rate of excitatory and inhibitory population of spike neuron by using both algorithms. Training results showed both agents able to explore in the custom environment with OpenAI Gym framework. The trained model for both algorithms capable to balance the firing rate of excitatory and inhibitory of the spike neuron. NAF achieved 0.80% of the average percentage error of rate of difference between target and actual neuron rate whereas DQN obtained 0.96%. NAF attained the goal faster than DQN with only 3 steps taken for actual output neuron rate to meet with or close to target neuron firing rate.

Download Full-text

Obstacle Avoidance Drone by Deep Reinforcement Learning and Its Racing with Human Pilot

Applied Sciences ◽

10.3390/app9245571 ◽

2019 ◽

Vol 9 (24) ◽

pp. 5571 ◽

Cited By ~ 1

Author(s):

Sang-Yun Shin ◽

Yong-Won Kang ◽

Yong-Guk Kim

Keyword(s):

Reinforcement Learning ◽

Supervised Learning ◽

Obstacle Avoidance ◽

Network Performance ◽

Action Space ◽

Continuous Action ◽

Discrete Action

Drones with obstacle avoidance capabilities have attracted much attention from researchers recently. They typically adopt either supervised learning or reinforcement learning (RL) for training their networks. The drawback of supervised learning is that labeling of the massive dataset is laborious and time-consuming, whereas RL aims to overcome such a problem by letting an agent learn with the data from its environment. The present study aims to utilize diverse RL within two categories: (1) discrete action space and (2) continuous action space. The former has the advantage in optimization for vision datasets, but such actions can lead to unnatural behavior. For the latter, we propose a U-net based segmentation model with an actor-critic network. Performance is compared between these RL algorithms with three different environments such as the woodland, block world, and the arena world, as well as racing with human pilots. Results suggest that our best continuous algorithm easily outperformed the discrete ones and yet was similar to an expert pilot.

Download Full-text

A game strategy model in the digital curling system based on NFSP

Complex & Intelligent Systems ◽

10.1007/s40747-021-00345-6 ◽

2021 ◽

Author(s):

Yuntao Han ◽

Qibin Zhou ◽

Fuqing Duan

Keyword(s):

Reinforcement Learning ◽

Nash Equilibrium ◽

Action Space ◽

Learning Networks ◽

Game Tree ◽

Continuous Action ◽

Extensive Game ◽

Strategy Model ◽

Zero Sum ◽

Tree Searching

AbstractThe digital curling game is a two-player zero-sum extensive game in a continuous action space. There are some challenging problems that are still not solved well, such as the uncertainty of strategy, the large game tree searching, and the use of large amounts of supervised data, etc. In this work, we combine NFSP and KR-UCT for digital curling games, where NFSP uses two adversary learning networks and can automatically produce supervised data, and KR-UCT can be used for large game tree searching in continuous action space. We propose two reward mechanisms to make reinforcement learning converge quickly. Experimental results validate the proposed method, and show the strategy model can reach the Nash equilibrium.

Download Full-text

On-Demand Channel Bonding in Heterogeneous WLANs: A Multi-Agent Deep Reinforcement Learning Approach

Sensors ◽

10.3390/s20102789 ◽

2020 ◽

Vol 20 (10) ◽

pp. 2789 ◽

Cited By ~ 1

Author(s):

Hang Qi ◽

Hao Huang ◽

Zhiqun Hu ◽

Xiangming Wen ◽

Zhaoming Lu

Keyword(s):

Reinforcement Learning ◽

Transmission Rate ◽

Single Agent ◽

Time Of Day ◽

Action Space ◽

Traffic Load ◽

Traffic Demand ◽

Channel Bonding ◽

On Demand ◽

Multi Agent

In order to meet the ever-increasing traffic demand of Wireless Local Area Networks (WLANs), channel bonding is introduced in IEEE 802.11 standards. Although channel bonding effectively increases the transmission rate, the wider channel reduces the number of non-overlapping channels and is more susceptible to interference. Meanwhile, the traffic load differs from one access point (AP) to another and changes significantly depending on the time of day. Therefore, the primary channel and channel bonding bandwidth should be carefully selected to meet traffic demand and guarantee the performance gain. In this paper, we proposed an On-Demand Channel Bonding (O-DCB) algorithm based on Deep Reinforcement Learning (DRL) for heterogeneous WLANs to reduce transmission delay, where the APs have different channel bonding capabilities. In this problem, the state space is continuous and the action space is discrete. However, the size of action space increases exponentially with the number of APs by using single-agent DRL, which severely affects the learning rate. To accelerate learning, Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is used to train O-DCB. Real traffic traces collected from a campus WLAN are used to train and test O-DCB. Simulation results reveal that the proposed algorithm has good convergence and lower delay than other algorithms.

Download Full-text

Continuous reinforcement learning based ramp jump control for single-track two-wheeled robots

Transactions of the Institute of Measurement and Control ◽

10.1177/01423312211037847 ◽

2021 ◽

pp. 014233122110378

Author(s):

Qingyuan Zheng ◽

Duo Wang ◽

Zhang Chen ◽

Yiyong Sun ◽

Bin Liang

Keyword(s):

Reinforcement Learning ◽

Energy Savings ◽

Control Method ◽

Learning Control ◽

Action Space ◽

Gradient Algorithm ◽

Single Track ◽

Wheeled Robots ◽

Reward Function ◽

Wheeled Robot

Single-track two-wheeled robots have become an important research topic in recent years, owing to their simple structure, energy savings and ability to run on narrow roads. However, the ramp jump remains a challenging task. In this study, we propose to realize a single-track two-wheeled robot ramp jump. We present a control method that employs continuous action reinforcement learning techniques for single-track two-wheeled robot control. We design a novel reward function for reinforcement learning, optimize the dimensions of the action space, and enable training under the deep deterministic policy gradient algorithm. Finally, we validate the control method through simulation experiments and successfully realize the single-track two-wheeled robot ramp jump task. Simulation results validate that the control method is effective and has several advantages over high-dimension action space control, reinforcement learning control of sparse reward function and discrete action reinforcement learning control.

Download Full-text

MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/430 ◽

2020 ◽

Author(s):

Mohammadamin Barekatain ◽

Ryo Yonetani ◽

Masashi Hamaya

Keyword(s):

Reinforcement Learning ◽

Task Performance ◽

Experimental Evaluation ◽

Control Problems ◽

Target Task ◽

Learning Efficiency ◽

Simulated Environments ◽

Discrete Action ◽

Key Techniques ◽

Action Spaces

Transfer reinforcement learning (RL) aims at improving the learning efficiency of an agent by exploiting knowledge from other source agents trained on relevant tasks. However, it remains challenging to transfer knowledge between different environmental dynamics without having access to the source environments. In this work, we explore a new challenge in transfer RL, where only a set of source policies collected under diverse unknown dynamics is available for learning a target task efficiently. To address this problem, the proposed approach, MULTI-source POLicy AggRegation (MULTIPOLAR), comprises two key techniques. We learn to aggregate the actions provided by the source policies adaptively to maximize the target task performance. Meanwhile, we learn an auxiliary network that predicts residuals around the aggregated actions, which ensures the target policy's expressiveness even when some of the source policies perform poorly. We demonstrated the effectiveness of MULTIPOLAR through an extensive experimental evaluation across six simulated environments ranging from classic control problems to challenging robotics simulations, under both continuous and discrete action spaces. The demo videos and code are available on the project webpage: https://omron-sinicx.github.io/multipolar/.

Download Full-text