scholarly journals Opposition-Based Reinforcement Learning

Author(s):  
Hamid R. Tizhoosh ◽  

Reinforcement learning is a machine intelligence scheme for learning in highly dynamic, probabilistic environments. By interaction with the environment, reinforcement agents learn optimal control policies, especially in the absence of a priori knowledge and/or a sufficiently large amount of training data. Despite its advantages, however, reinforcement learning suffers from a major drawback - high calculation cost because convergence to an optimal solution usually requires that all states be visited frequently to ensure that policy is reliable. This is not always possible, however, due to the complex, high-dimensional state space in many applications. This paper introduces opposition-based reinforcement learning, inspired by opposition-based learning, to speed up convergence. Considering opposite actions simultaneously enables individual states to be updated more than once shortening exploration and expediting convergence. Three versions of Q-learning algorithm will be given as examples. Experimental results for the grid world problem of different sizes demonstrate the superior performance of the proposed approach.

2021 ◽  
Vol 10 (1) ◽  
pp. 21
Author(s):  
Omar Nassef ◽  
Toktam Mahmoodi ◽  
Foivos Michelinakis ◽  
Kashif Mahmood ◽  
Ahmed Elmokashfi

This paper presents a data driven framework for performance optimisation of Narrow-Band IoT user equipment. The proposed framework is an edge micro-service that suggests one-time configurations to user equipment communicating with a base station. Suggested configurations are delivered from a Configuration Advocate, to improve energy consumption, delay, throughput or a combination of those metrics, depending on the user-end device and the application. Reinforcement learning utilising gradient descent and genetic algorithm is adopted synchronously with machine and deep learning algorithms to predict the environmental states and suggest an optimal configuration. The results highlight the adaptability of the Deep Neural Network in the prediction of intermediary environmental states, additionally the results present superior performance of the genetic reinforcement learning algorithm regarding its performance optimisation.


2020 ◽  
Vol 34 (01) ◽  
pp. 1153-1160 ◽  
Author(s):  
Xinshi Zang ◽  
Huaxiu Yao ◽  
Guanjie Zheng ◽  
Nan Xu ◽  
Kai Xu ◽  
...  

Using reinforcement learning for traffic signal control has attracted increasing interests recently. Various value-based reinforcement learning methods have been proposed to deal with this classical transportation problem and achieved better performances compared with traditional transportation methods. However, current reinforcement learning models rely on tremendous training data and computational resources, which may have bad consequences (e.g., traffic jams or accidents) in the real world. In traffic signal control, some algorithms have been proposed to empower quick learning from scratch, but little attention is paid to learning by transferring and reusing learned experience. In this paper, we propose a novel framework, named as MetaLight, to speed up the learning process in new scenarios by leveraging the knowledge learned from existing scenarios. MetaLight is a value-based meta-reinforcement learning workflow based on the representative gradient-based meta-learning algorithm (MAML), which includes periodically alternate individual-level adaptation and global-level adaptation. Moreover, MetaLight improves the-state-of-the-art reinforcement learning model FRAP in traffic signal control by optimizing its model structure and updating paradigm. The experiments on four real-world datasets show that our proposed MetaLight not only adapts more quickly and stably in new traffic scenarios, but also achieves better performance.


Author(s):  
Tsega Weldu Araya ◽  
Md Rashed Ibn Nawab ◽  
A. P. Yuan Ling

As technology overgrows, the assortment of information and the density of work becomes demanding to manage. To resolve the density of employment and human labor, machine-learning (ML) technology developed. Reinforcement learning (RL) is the recent advancement of ML studies. Multi-agent reinforcement learning (MARL) is useful to train multiple agents in the surrounding environment. The previous research studies focused on two-agent cooperation. Their data representation was held in a two-dimensional array, which is called a matrix. The limitation of this two-dimensional array appears as the training data of agents increases. The growth in the training data of agents creates storage drawbacks and data redundancy. Our first aim in this research is to improve an algorithm that can represent MARL training in tensor. In MARL, multiple agents are work together to achieve joint work. To share the training records and data of numerous agents, we need to collect the previous cumulative experience of agents in tensor. Secondly, we will discover the agent's cooperation and competition, with local and global goals of agents in MARL. Local goals are the cooperation of agents in a group or team where we use the training model as a student and teacher agent. The global goal is the competition between two contrary teams to acquire the reward. All learning agents have their Q table for storing the individual agent's training data in an environment. The growth in the number of learning agents, their training experience in Q tables, and the requirement for representing multiple data become the most challenging issue. We introduce tensor to store various data to resolve the challenges for data representation in multiple agent associations. Tensor is expressed as the three-dimensional array, although it is an N-way array, which is useful for representing and accessing numerous data. Finally, we will implement an algorithm for learning three cooperative agents against the opposed team using a tensor-based framework in the Q learning algorithm. We will provide an algorithm that can store the training records and data of multiple agents. Tensor advances to get a small storage size than the matrix for the training records of agents. Although three agent cooperation benefits to having maximum optimal reward.


2005 ◽  
Vol 24 ◽  
pp. 81-108 ◽  
Author(s):  
P. Geibel ◽  
F. Wysotzki

In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.


2019 ◽  
Vol 28 (4) ◽  
pp. 273-292 ◽  
Author(s):  
Sherif Abdelfattah ◽  
Kathryn Kasmarik ◽  
Jiankun Hu

Multi-objective Markov decision processes are a special kind of multi-objective optimization problem that involves sequential decision making while satisfying the Markov property of stochastic processes. Multi-objective reinforcement learning methods address this kind of problem by fusing the reinforcement learning paradigm with multi-objective optimization techniques. One major drawback of these methods is the lack of adaptability to non-stationary dynamics in the environment. This is because they adopt optimization procedures that assume stationarity in order to evolve a coverage set of policies that can solve the problem. This article introduces a developmental optimization approach that can evolve the policy coverage set while exploring the preference space over the defined objectives in an online manner. We propose a novel multi-objective reinforcement learning algorithm that can robustly evolve a convex coverage set of policies in an online manner in non-stationary environments. We compare the proposed algorithm with two state-of-the-art multi-objective reinforcement learning algorithms in stationary and non-stationary environments. Results showed that the proposed algorithm significantly outperforms the existing algorithms in non-stationary environments while achieving comparable results in stationary environments.


Author(s):  
Imthias Ahamed T.P. ◽  
Nagendra Rao P.S. ◽  
Sastry P.S.

This paper presents the design and implementation of a learning controller for the Automatic Generation Control (AGC) in power systems based on a reinforcement learning (RL) framework. In contrast to the recent RL scheme for AGC proposed by us, the present method permits handling of power system variables such as Area Control Error (ACE) and deviations from scheduled frequency and tie-line flows as continuous variables. (In the earlier scheme, these variables have to be quantized into finitely many levels). The optimal control law is arrived at in the RL framework by making use of Q-learning strategy. Since the state variables are continuous, we propose the use of Radial Basis Function (RBF) neural networks to compute the Q-values for a given input state. Since, in this application we cannot provide training data appropriate for the standard supervised learning framework, a reinforcement learning algorithm is employed to train the RBF network. We also employ a novel exploration strategy, based on a Learning Automata algorithm, for generating training samples during Q-learning. The proposed scheme, in addition to being simple to implement, inherits all the attractive features of an RL scheme such as model independent design, flexibility in control objective specification, robustness etc. Two implementations of the proposed approach are presented. Through simulation studies the attractiveness of this approach is demonstrated.


2020 ◽  
Vol 17 (2) ◽  
pp. 619-646
Author(s):  
Chao Wang ◽  
Xing Qiu ◽  
Hui Liu ◽  
Dan Li ◽  
Kaiguang Zhao ◽  
...  

Multi-Agent system has broad application in real world, whose security performance, however, is barely considered. Reinforcement learning is one of the most important methods to resolve Multi-Agent problems. At present, certain progress has been made in applying Multi-Agent reinforcement learning to robot system, man-machine match, and automatic, etc. However, in the above area, an agent may fall into unsafe states where the agent may find it difficult to bypass obstacles, to receive information from other agents and so on. Ensuring the safety of Multi-Agent system is of great importance in the above areas where an agent may fall into dangerous states that are irreversible, causing great damage. To solve the safety problem, in this paper we introduce a Multi-Agent Cooperation Q-Learning Algorithm based on Constrained Markov Game. In this method, safety constraints are added to the set of actions, and each agent, when interacting with the environment to search for optimal values, should be restricted by the safety rules, so as to obtain an optimal policy that satisfies the security requirements. Since traditional Multi-Agent reinforcement learning algorithm is no more suitable for the proposed model in this paper, a new solution is introduced for calculating the global optimum state-action function that satisfies the safety constraints. We take advantage of the Lagrange multiplier method to determine the optimal action that can be performed in the current state based on the premise of linearizing constraint functions, under conditions that the state-action function and the constraint function are both differentiable, which not only improves the efficiency and accuracy of the algorithm, but also guarantees to obtain the global optimal solution. The experiments verify the effectiveness of the algorithm.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Yunmei Yuan ◽  
Hongyu Li ◽  
Lili Ji

Nowadays, finding the optimal route for vehicles through online vehicle path planning is one of the main problems that the logistics industry needs to solve. Due to the uncertainty of the transportation system, especially the last-mile delivery problem of small packages in uncertain logistics transportation, the calculation of logistics vehicle routing planning becomes more complex than before. Most of the existing solutions are less applied to new technologies such as machine learning, and most of them use a heuristic algorithm. This kind of solution not only needs to set a lot of constraints but also requires much calculation time in the logistics network with high demand density. To design the uncertain logistics transportation path with minimum time, this paper proposes a new optimization strategy based on deep reinforcement learning that converts the uncertain online logistics routing problems into vehicle path planning problems and designs an embedded pointer network for obtaining the optimal solution. Considering the long time to solve the neural network, it is unrealistic to train parameters through supervised data. This article uses an unsupervised method to train the parameters. Because the process of parameter training is offline, this strategy can avoid the high delay. Through the simulation part, it is not difficult to see that the strategy proposed in this paper will effectively solve the uncertain logistics scheduling problem under the limited computing time, and it is significantly better than other strategies. Compared with traditional mathematical procedures, the algorithm proposed in this paper can reduce the driving distance by 60.71%. In addition, this paper also studies the impact of some key parameters on the effect of the program.


Author(s):  
Hao Ji ◽  
Yan Jin

Abstract Self-organizing systems (SOS) are able to perform complex tasks in unforeseen situations with adaptability. Previous work has introduced field-based approaches and rule-based social structuring for individual agents to not only comprehend the task situations but also take advantage of the social rule-based agent relations in order to accomplish their overall tasks without a centralized controller. Although the task fields and social rules can be predefined for relatively simple task situations, when the task complexity increases and task environment changes, having a priori knowledge about these fields and the rules may not be feasible. In this paper, we propose a multi-agent reinforcement learning based model as a design approach to solving the rule generation problem with complex SOS tasks. A deep multi-agent reinforcement learning algorithm was devised as a mechanism to train SOS agents for acquisition of the task field and social rule knowledge, and the scalability property of this learning approach was investigated with respect to the changing team sizes and environmental noises. Through a set of simulation studies on a box-pushing problem, the results have shown that the SOS design based on deep multi-agent reinforcement learning can be generalizable with different individual settings when the training starts with larger number of agents, but if a SOS is trained with smaller team sizes, the learned neural network does not scale up to larger teams. Design of SOS with a deep reinforcement learning model should keep this in mind and training should be carried out with larger team sizes.


2020 ◽  
Vol 34 (01) ◽  
pp. 710-718
Author(s):  
Nan Jiang ◽  
Sheng Jin ◽  
Zhiyao Duan ◽  
Changshui Zhang

This paper presents a deep reinforcement learning algorithm for online accompaniment generation, with potential for real-time interactive human-machine duet improvisation. Different from offline music generation and harmonization, online music accompaniment requires the algorithm to respond to human input and generate the machine counterpart in a sequential order. We cast this as a reinforcement learning problem, where the generation agent learns a policy to generate a musical note (action) based on previously generated context (state). The key of this algorithm is the well-functioning reward model. Instead of defining it using music composition rules, we learn this model from monophonic and polyphonic training data. This model considers the compatibility of the machine-generated note with both the machine-generated context and the human-generated context. Experiments show that this algorithm is able to respond to the human part and generate a melodic, harmonic and diverse machine part. Subjective evaluations on preferences show that the proposed algorithm generates music pieces of higher quality than the baseline method.


Sign in / Sign up

Export Citation Format

Share Document