scholarly journals KoGuN: Accelerating Deep Reinforcement Learning via Integrating Human Suboptimal Knowledge

Author(s):  
Peng Zhang ◽  
Jianye Hao ◽  
Weixun Wang ◽  
Hongyao Tang ◽  
Yi Ma ◽  
...  

Reinforcement learning agents usually learn from scratch, which requires a large number of interactions with the environment. This is quite different from the learning process of human. When faced with a new task, human naturally have the common sense and use the prior knowledge to derive an initial policy and guide the learning process afterwards. Although the prior knowledge may be not fully applicable to the new task, the learning process is significantly sped up since the initial policy ensures a quick-start of learning and intermediate guidance allows to avoid unnecessary exploration. Taking this inspiration, we propose knowledge guided policy network (KoGuN), a novel framework that combines human prior suboptimal knowledge with reinforcement learning. Our framework consists of a fuzzy rule controller to represent human knowledge and a refine module to finetune suboptimal prior knowledge. The proposed framework is end-to-end and can be combined with existing policy-based reinforcement learning algorithm. We conduct experiments on several control tasks. The empirical results show that our approach, which combines suboptimal human knowledge and RL, achieves significant improvement on learning efficiency of flat RL algorithms, even with very low-performance human prior knowledge.

2017 ◽  
Vol 1 (1) ◽  
pp. 21-42 ◽  
Author(s):  
Anestis Fachantidis ◽  
Matthew Taylor ◽  
Ioannis Vlahavas

In this article, we study the transfer learning model of action advice under a budget. We focus on reinforcement learning teachers providing action advice to heterogeneous students playing the game of Pac-Man under a limited advice budget. First, we examine several critical factors affecting advice quality in this setting, such as the average performance of the teacher, its variance and the importance of reward discounting in advising. The experiments show that the best performers are not always the best teachers and reveal the non-trivial importance of the coefficient of variation (CV) as a statistic for choosing policies that generate advice. The CV statistic relates variance to the corresponding mean. Second, the article studies policy learning for distributing advice under a budget. Whereas most methods in the relevant literature rely on heuristics for advice distribution, we formulate the problem as a learning one and propose a novel reinforcement learning algorithm capable of learning when to advise or not. The proposed algorithm is able to advise even when it does not have knowledge of the student’s intended action and needs significantly less training time compared to previous learning approaches. Finally, in this article, we argue that learning to advise under a budget is an instance of a more generic learning problem: Constrained Exploitation Reinforcement Learning.


2009 ◽  
Vol 10 (4) ◽  
pp. 329-341 ◽  
Author(s):  
Aleksandras Vytautas Rutkauskas ◽  
Tomas Ramanauskas

In this paper we propose an artificial stock market model based on interaction of heterogeneous agents whose forward-looking behaviour is driven by the reinforcement-learning algorithm combined with some evolutionary selection mechanism. We use the model for the analysis of market self-regulation abilities, market efficiency and determinants of emergent properties of the financial market. Distinctive and novel features of the model include strong emphasis on the economic content of individual decision-making, application of the Q-learning algorithm for driving individual behaviour, and rich market setup. Along with that a parallel version of the model is presented, which is mainly based on research of current changes in the market, as well as on search of newly emerged consistent patterns, and which has been repeatedly used for optimal decisions’ search experiments in various capital markets.


Author(s):  
Tsega Weldu Araya ◽  
Md Rashed Ibn Nawab ◽  
A. P. Yuan Ling

As technology overgrows, the assortment of information and the density of work becomes demanding to manage. To resolve the density of employment and human labor, machine-learning (ML) technology developed. Reinforcement learning (RL) is the recent advancement of ML studies. Multi-agent reinforcement learning (MARL) is useful to train multiple agents in the surrounding environment. The previous research studies focused on two-agent cooperation. Their data representation was held in a two-dimensional array, which is called a matrix. The limitation of this two-dimensional array appears as the training data of agents increases. The growth in the training data of agents creates storage drawbacks and data redundancy. Our first aim in this research is to improve an algorithm that can represent MARL training in tensor. In MARL, multiple agents are work together to achieve joint work. To share the training records and data of numerous agents, we need to collect the previous cumulative experience of agents in tensor. Secondly, we will discover the agent's cooperation and competition, with local and global goals of agents in MARL. Local goals are the cooperation of agents in a group or team where we use the training model as a student and teacher agent. The global goal is the competition between two contrary teams to acquire the reward. All learning agents have their Q table for storing the individual agent's training data in an environment. The growth in the number of learning agents, their training experience in Q tables, and the requirement for representing multiple data become the most challenging issue. We introduce tensor to store various data to resolve the challenges for data representation in multiple agent associations. Tensor is expressed as the three-dimensional array, although it is an N-way array, which is useful for representing and accessing numerous data. Finally, we will implement an algorithm for learning three cooperative agents against the opposed team using a tensor-based framework in the Q learning algorithm. We will provide an algorithm that can store the training records and data of multiple agents. Tensor advances to get a small storage size than the matrix for the training records of agents. Although three agent cooperation benefits to having maximum optimal reward.


2002 ◽  
Vol 16 ◽  
pp. 59-104 ◽  
Author(s):  
C. Drummond

This paper discusses a system that accelerates reinforcement learning by using transfer from related tasks. Without such transfer, even if two tasks are very similar at some abstract level, an extensive re-learning effort is required. The system achieves much of its power by transferring parts of previously learned solutions rather than a single complete solution. The system exploits strong features in the multi-dimensional function produced by reinforcement learning in solving a particular task. These features are stable and easy to recognize early in the learning process. They generate a partitioning of the state space and thus the function. The partition is represented as a graph. This is used to index and compose functions stored in a case base to form a close approximation to the solution of the new task. Experiments demonstrate that function composition often produces more than an order of magnitude increase in learning rate compared to a basic reinforcement learning algorithm.


Author(s):  
Akira Notsu ◽  
◽  
Yuichi Hattori ◽  
Seiki Ubukata ◽  
Katsuhiro Honda ◽  
...  

In reinforcement learning, agents can learn appropriate actions for each situation based on the consequences of these actions after interacting with the environment. Reinforcement learning is compatible with self-organizing maps that accomplish unsupervised learning by reacting to impulses and strengthening neurons. Therefore, numerous studies have investigated the topic of reinforcement learning in which agents learn the state space using self-organizing maps. In this study, while we intended to apply these previous studies to transfer the learning and visualization of the human learning process, we introduced self-organizing maps into reinforcement learning and attempted to make their “state and action” learning process visible. We performed numerical experiments with the 2D goal-search problem; our model visualized the learning process of the agent.


2021 ◽  
Vol 11 (3) ◽  
pp. 1131
Author(s):  
Liwei Hou ◽  
Hengsheng Wang ◽  
Haoran Zou ◽  
Qun Wang

Autonomous learning of robotic skills seems to be more natural and more practical than engineered skills, analogous to the learning process of human individuals. Policy gradient methods are a type of reinforcement learning technique which have great potential in solving robot skills learning problems. However, policy gradient methods require too many instances of robot online interaction with the environment in order to learn a good policy, which means lower efficiency of the learning process and a higher likelihood of damage to both the robot and the environment. In this paper, we propose a two-phase (imitation phase and practice phase) framework for efficient learning of robot walking skills, in which we pay more attention to the quality of skill learning and sample efficiency at the same time. The training starts with what we call the first stage or the imitation phase of learning, updating the parameters of the policy network in a supervised learning manner. The training set used in the policy network learning is composed of the experienced trajectories output by the iterative linear Gaussian controller. This paper also refers to these trajectories as near-optimal experiences. In the second stage, or the practice phase, the experiences for policy network learning are collected directly from online interactions, and the policy network parameters are updated with model-free reinforcement learning. The experiences from both stages are stored in the weighted replay buffer, and they are arranged in order according to the experience scoring algorithm proposed in this paper. The proposed framework is tested on a biped robot walking task in a MATLAB simulation environment. The results show that the sample efficiency of the proposed framework is much higher than ordinary policy gradient algorithms. The algorithm proposed in this paper achieved the highest cumulative reward, and the robot learned better walking skills autonomously. In addition, the weighted replay buffer method can be made as a general module for other model-free reinforcement learning algorithms. Our framework provides a new way to combine model-based reinforcement learning with model-free reinforcement learning to efficiently update the policy network parameters in the process of robot skills learning.


2008 ◽  
Vol 18 (1) ◽  
pp. 23-27 ◽  
Author(s):  
Hamid Boubertakh ◽  
Mohamed Tadjine ◽  
Pierre-Yves Glorennec ◽  
Salim Labiod

This paper proposes a new fuzzy logic-based navigation method for a mobile robot moving in an unknown environment. This method allows the robot obstacles avoidance and goal seeking without being stuck in local minima. A simple Fuzzy controller is constructed based on the human sense and a fuzzy reinforcement learning algorithm is used to fine tune the fuzzy rule base parameters. The advantages of the proposed method are its simplicity, its easy implementation for industrial applications, and the robot joins its objective despite the environment complexity. Some simulation results of the proposed method and a comparison with previous works are provided.


Author(s):  
Sam Hamzeloo ◽  
Mansoor Zolghadri Jahromi

We present a new incremental fuzzy reinforcement learning algorithm to find a sub-optimal policy for infinite-horizon Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs). The algorithm addresses the high computational complexity of solving large Dec-POMDPs by generating a compact fuzzy rule-base for each agent. In our method, each agent uses its own fuzzy rule-base to make the decisions. The fuzzy rules in these rule-bases are incrementally created and tuned according to experiences of the agents. Reinforcement learning is used to tune the behavior of each agent in such a way that maximum global reward is achieved. In addition, we propose a method to construct the initial rule-base for each agent using the solution of the underlying MDP. This drastically improves the performance of the algorithm in comparison with random initialization of the rule-base. We assess the performance of our proposed method using several benchmark problems in comparison with some state-of-the-art methods. Experimental results show that our algorithm achieves better or similar reward when compared with other methods. However, from the runtime point of view, our method is superior to all previous methods. Using a compact fuzzy rule-base not only decreases the amount of memory used but also significantly speeds up the learning phase.


2020 ◽  
Vol 2020 ◽  
pp. 1-17
Author(s):  
Zhuang Wang ◽  
Hui Li ◽  
Haolin Wu ◽  
Zhaoxin Wu

In a one-on-one air combat game, the opponent’s maneuver strategy is usually not deterministic, which leads us to consider a variety of opponent’s strategies when designing our maneuver strategy. In this paper, an alternate freeze game framework based on deep reinforcement learning is proposed to generate the maneuver strategy in an air combat pursuit. The maneuver strategy agents for aircraft guidance of both sides are designed in a flight level with fixed velocity and the one-on-one air combat scenario. Middleware which connects the agents and air combat simulation software is developed to provide a reinforcement learning environment for agent training. A reward shaping approach is used, by which the training speed is increased, and the performance of the generated trajectory is improved. Agents are trained by alternate freeze games with a deep reinforcement algorithm to deal with nonstationarity. A league system is adopted to avoid the red queen effect in the game where both sides implement adaptive strategies. Simulation results show that the proposed approach can be applied to maneuver guidance in air combat, and typical angle fight tactics can be learnt by the deep reinforcement learning agents. For the training of an opponent with the adaptive strategy, the winning rate can reach more than 50%, and the losing rate can be reduced to less than 15%. In a competition with all opponents, the winning rate of the strategic agent selected by the league system is more than 44%, and the probability of not losing is about 75%.


Sign in / Sign up

Export Citation Format

Share Document