scholarly journals Reinforcement Learning with Perturbed Rewards

2020 ◽  
Vol 34 (04) ◽  
pp. 6202-6209 ◽  
Author(s):  
Jingkang Wang ◽  
Yang Liu ◽  
Bo Li

Recent studies have shown that reinforcement learning (RL) models are vulnerable in various noisy scenarios. For instance, the observed reward channel is often subject to noise in practice (e.g., when rewards are collected through sensors), and is therefore not credible. In addition, for applications such as robotics, a deep reinforcement learning (DRL) algorithm can be manipulated to produce arbitrary errors by receiving corrupted rewards. In this paper, we consider noisy RL problems with perturbed rewards, which can be approximated with a confusion matrix. We develop a robust RL framework that enables agents to learn in noisy environments where only perturbed rewards are observed. Our solution framework builds on existing RL/DRL algorithms and firstly addresses the biased noisy reward setting without any assumptions on the true distribution (e.g., zero-mean Gaussian noise as made in previous works). The core ideas of our solution include estimating a reward confusion matrix and defining a set of unbiased surrogate rewards. We prove the convergence and sample complexity of our approach. Extensive experiments on different DRL platforms show that trained policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the state-of-the-art PPO algorithm is able to obtain 84.6% and 80.8% improvements on average score for five Atari games, with error rates as 10% and 30% respectively.

2020 ◽  
Vol 34 (10) ◽  
pp. 13905-13906
Author(s):  
Rohan Saphal ◽  
Balaraman Ravindran ◽  
Dheevatsa Mudigere ◽  
Sasikanth Avancha ◽  
Bharat Kaul

Reinforcement learning algorithms are sensitive to hyper-parameters and require tuning and tweaking for specific environments for improving performance. Ensembles of reinforcement learning models on the other hand are known to be much more robust and stable. However, training multiple models independently on an environment suffers from high sample complexity. We present here a methodology to create multiple models from a single training instance that can be used in an ensemble through directed perturbation of the model parameters at regular intervals. This allows training a single model that converges to several local minima during the optimization process as a result of the perturbation. By saving the model parameters at each such instance, we obtain multiple policies during training that are ensembled during evaluation. We evaluate our approach on challenging discrete and continuous control tasks and also discuss various ensembling strategies. Our framework is substantially sample efficient, computationally inexpensive and is seen to outperform state of the art (SOTA) approaches


Entropy ◽  
2021 ◽  
Vol 23 (9) ◽  
pp. 1133
Author(s):  
Shanzhi Gu ◽  
Mingyang Geng ◽  
Long Lan

The aim of multi-agent reinforcement learning systems is to provide interacting agents with the ability to collaboratively learn and adapt to the behavior of other agents. Typically, an agent receives its private observations providing a partial view of the true state of the environment. However, in realistic settings, the harsh environment might cause one or more agents to show arbitrarily faulty or malicious behavior, which may suffice to allow the current coordination mechanisms fail. In this paper, we study a practical scenario of multi-agent reinforcement learning systems considering the security issues in the presence of agents with arbitrarily faulty or malicious behavior. The previous state-of-the-art work that coped with extremely noisy environments was designed on the basis that the noise intensity in the environment was known in advance. However, when the noise intensity changes, the existing method has to adjust the configuration of the model to learn in new environments, which limits the practical applications. To overcome these difficulties, we present an Attention-based Fault-Tolerant (FT-Attn) model, which can select not only correct, but also relevant information for each agent at every time step in noisy environments. The multihead attention mechanism enables the agents to learn effective communication policies through experience concurrent with the action policies. Empirical results showed that FT-Attn beats previous state-of-the-art methods in some extremely noisy environments in both cooperative and competitive scenarios, much closer to the upper-bound performance. Furthermore, FT-Attn maintains a more general fault tolerance ability and does not rely on the prior knowledge about the noise intensity of the environment.


2020 ◽  
Vol 34 (04) ◽  
pp. 3521-3528
Author(s):  
Minghao Chen ◽  
Shuai Zhao ◽  
Haifeng Liu ◽  
Deng Cai

Recently, remarkable progress has been made in learning transferable representation across domains. Previous works in domain adaptation are majorly based on two techniques: domain-adversarial learning and self-training. However, domain-adversarial learning only aligns feature distributions between domains but does not consider whether the target features are discriminative. On the other hand, self-training utilizes the model predictions to enhance the discrimination of target features, but it is unable to explicitly align domain distributions. In order to combine the strengths of these two methods, we propose a novel method called Adversarial-Learned Loss for Domain Adaptation (ALDA). We first analyze the pseudo-label method, a typical self-training method. Nevertheless, there is a gap between pseudo-labels and the ground truth, which can cause incorrect training. Thus we introduce the confusion matrix, which is learned through an adversarial manner in ALDA, to reduce the gap and align the feature distributions. Finally, a new loss function is auto-constructed from the learned confusion matrix, which serves as the loss for unlabeled target samples. Our ALDA outperforms state-of-the-art approaches in four standard domain adaptation datasets. Our code is available at https://github.com/ZJULearning/ALDA.


Author(s):  
Patryk Chrabąszcz ◽  
Ilya Loshchilov ◽  
Frank Hutter

Evolution Strategies (ES) have recently been demonstrated to be a viable alternative to reinforcement learning (RL) algorithms on a set of challenging deep learning problems, including Atari games and MuJoCo humanoid locomotion benchmarks. While the ES algorithms in that work belonged to the specialized class of natural evolution strategies (which resemble approximate gradient RL algorithms, such as REINFORCE), we demonstrate that even a very basic canonical ES algorithm can achieve the same or even better performance. This success of a basic ES algorithm suggests that the state-of-the-art can be advanced further by integrating the many advances made in the field of ES in the last decades.We also demonstrate that ES algorithms have very different performance characteristics than traditional RL algorithms: on some games, they learn to exploit the environment and perform much better while on others they can get stuck in suboptimal local minima. Combining their strengths and weaknesses with those of traditional RL algorithms is therefore likely to lead to new advances in the state-of-the-art for solving RL problems.


Author(s):  
Catherine Fahy

The provisions of the National Cultural Institutions Act 1997, which will establish the National Library of Ireland as an independent statutory body under a new Board, are due to be implemented in 2005. The years since the Act was passed have seen substantial increases in funding and staff numbers, albeit from a very low base. A phased building programme has delivered improved visitor and administration facilities, but crucial storage and reading room elements have been delayed. Collection development has benefited from government measures including legislation for tax credit for the donation of important material and for a Heritage Fund. A new Genealogical Service has been an outstanding success, but other substantial improvements in service are contingent on the building programme. Retrospective catalogue conversion projects have been completed for the core Irish printed collections and these catalogues are available online. A substantial amount of retrospective conversion of catalogues of other collections remains to be done. Digital projects are underway which will lead to an increased amount of material from the graphic collections coming online. A major new state of the art exhibition facility opened in 2004 with the inaugural exhibition James Joyce and Ulysses at the National Library of Ireland. Progress has been made in securing conservation resources, and in preservation microfilming and reformatting programmes. The major challenges facing the Board will be to push through the building programme, to carry through digital and retrospective conversion programmes, and to secure adequate staffing and financial resources.


2020 ◽  
Vol 34 (04) ◽  
pp. 5125-5133
Author(s):  
Marlos C. Machado ◽  
Marc G. Bellemare ◽  
Michael Bowling

In this paper we introduce a simple approach for exploration in reinforcement learning (RL) that allows us to develop theoretically justified algorithms in the tabular case but that is also extendable to settings where function approximation is required. Our approach is based on the successor representation (SR), which was originally introduced as a representation defining state generalization by the similarity of successor states. Here we show that the norm of the SR, while it is being learned, can be used as a reward bonus to incentivize exploration. In order to better understand this transient behavior of the norm of the SR we introduce the substochastic successor representation (SSR) and we show that it implicitly counts the number of times each state (or feature) has been observed. We use this result to introduce an algorithm that performs as well as some theoretically sample-efficient approaches. Finally, we extend these ideas to a deep RL algorithm and show that it achieves state-of-the-art performance in Atari 2600 games when in a low sample-complexity regime.


Author(s):  
Mathieu Seurin ◽  
Florian Strub ◽  
Philippe Preux ◽  
Olivier Pietquin

Sparse rewards are double-edged training signals in reinforcement learning: easy to design but hard to optimize. Intrinsic motivation guidances have thus been developed toward alleviating the resulting exploration problem. They usually incentivize agents to look for new states through novelty signals. Yet, such methods encourage exhaustive exploration of the state space rather than focusing on the environment's salient interaction opportunities. We propose a new exploration method, called Don't Do What Doesn't Matter (DoWhaM), shifting the emphasis from state novelty to state with relevant actions. While most actions consistently change the state when used, e.g. moving the agent, some actions are only effective in specific states, e.g., opening a door, grabbing an object. DoWhaM detects and rewards actions that seldom affect the environment. We evaluate DoWhaM on the procedurally-generated environment MiniGrid against state-of-the-art methods. Experiments consistently show that DoWhaM greatly reduces sample complexity, installing the new state-of-the-art in MiniGrid.


2019 ◽  
Vol 3 (4) ◽  
pp. 743
Author(s):  
Salfilla Juliana

This research was motivated by the weakness of teachers in carrying out the learning process in Bandar Laksamana 1 Public Middle School. The purpose of this study is to improve the skills of teachers teaching in the classroom with the help of supervision. This research was conducted at Bandar Laksamana 1 Public Middle School. This research is a classroom action research consisting of two cycles with. Each cycle consists of four stages such as planning, implementation, observation, and reflection. The results of the study show that on the basic score, teacher teaching skills are included in the sufficient category with an average score of 60.21. After improvements were made in the first cycle, the results of the teacher's skills assessment increased to a good category with the teacher's average score of 75.54. For the implementation of the teacher's skills assessment in teaching in the second cycle again increased with a very good category with an average value of 85.75. Based on the results of the above research, researchers can conclude that with the implementation of supervision in Bandar Laksamana 1 Junior High School can improve teacher skills in teaching.


CounterText ◽  
2016 ◽  
Vol 2 (2) ◽  
pp. 217-235
Author(s):  
Gordon Calleja

This paper gives an insight into the design process of a game adaptation of Joy Division's Love Will Tear Us Apart (1980). It outlines the challenges faced in attempting to reconcile the diverging qualities of lyrical poetry and digital games. In so doing, the paper examines the design decisions made in every segment of the game with a particular focus on the tension between the core concerns of the lyrical work being adapted and established tenets of game design.


2019 ◽  
Vol 19 (25) ◽  
pp. 2348-2356 ◽  
Author(s):  
Neng-Zhong Xie ◽  
Jian-Xiu Li ◽  
Ri-Bo Huang

Acetoin is an important four-carbon compound that has many applications in foods, chemical synthesis, cosmetics, cigarettes, soaps, and detergents. Its stereoisomer (S)-acetoin, a high-value chiral compound, can also be used to synthesize optically active drugs, which could enhance targeting properties and reduce side effects. Recently, considerable progress has been made in the development of biotechnological routes for (S)-acetoin production. In this review, various strategies for biological (S)- acetoin production are summarized, and their constraints and possible solutions are described. Furthermore, future prospects of biological production of (S)-acetoin are discussed.


Sign in / Sign up

Export Citation Format

Share Document