Multi-Task Reinforcement Learning in Humans

ABSTRACTThe ability to transfer knowledge across tasks and generalize to novel ones is an important hallmark of human intelligence. Yet not much is known about human multi-task reinforcement learning. We study participants’ behavior in a novel two-step decision making task with multiple features and changing reward functions. We compare their behavior to two state-of-the-art algorithms for multi-task reinforcement learning, one that maps previous policies and encountered features to new reward functions and one that approximates value functions across tasks, as well as to standard model-based and model-free algorithms. Across three exploratory experiments and a large preregistered experiment, our results provide strong evidence for a strategy that maps previously learned policies to novel scenarios. These results enrich our understanding of human reinforcement learning in complex environments with changing task demands.

Download Full-text

Proximal policy optimization with model-based methods

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211935 ◽

2022 ◽

pp. 1-12

Author(s):

Shuailong Li ◽

Wei Zhang ◽

Huiwen Zhang ◽

Xin Zhang ◽

Yuquan Leng

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Transition Model ◽

Practical Applications ◽

Original Algorithm ◽

Policy Performance ◽

Model Based ◽

Model Free ◽

Future State ◽

Policy Optimization

Model-free reinforcement learning methods have successfully been applied to practical applications such as decision-making problems in Atari games. However, these methods have inherent shortcomings, such as a high variance and low sample efficiency. To improve the policy performance and sample efficiency of model-free reinforcement learning, we propose proximal policy optimization with model-based methods (PPOMM), a fusion method of both model-based and model-free reinforcement learning. PPOMM not only considers the information of past experience but also the prediction information of the future state. PPOMM adds the information of the next state to the objective function of the proximal policy optimization (PPO) algorithm through a model-based method. This method uses two components to optimize the policy: the error of PPO and the error of model-based reinforcement learning. We use the latter to optimize a latent transition model and predict the information of the next state. For most games, this method outperforms the state-of-the-art PPO algorithm when we evaluate across 49 Atari games in the Arcade Learning Environment (ALE). The experimental results show that PPOMM performs better or the same as the original algorithm in 33 games.

Download Full-text

Deterministic Value-Policy Gradients

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5732 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3316-3323

Author(s):

Qingpeng Cai ◽

Ling Pan ◽

Pingzhong Tang

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Learning Algorithms ◽

Infinite Horizon ◽

Gradient Algorithm ◽

Continuous Control ◽

Model Bias ◽

Model Free ◽

Policy Gradient ◽

Analytical Gradients

Reinforcement learning algorithms such as the deep deterministic policy gradient algorithm (DDPG) has been widely used in continuous control tasks. However, the model-free DDPG algorithm suffers from high sample complexity. In this paper we consider the deterministic value gradients to improve the sample efficiency of deep reinforcement learning algorithms. Previous works consider deterministic value gradients with the finite horizon, but it is too myopic compared with infinite horizon. We firstly give a theoretical guarantee of the existence of the value gradients in this infinite setting. Based on this theoretical guarantee, we propose a class of the deterministic value gradient algorithm (DVG) with infinite horizon, and different rollout steps of the analytical gradients by the learned model trade off between the variance of the value gradients and the model bias. Furthermore, to better combine the model-based deterministic value gradient estimators with the model-free deterministic policy gradient estimator, we propose the deterministic value-policy gradient (DVPG) algorithm. We finally conduct extensive experiments comparing DVPG with state-of-the-art methods on several standard continuous control benchmarks. Results demonstrate that DVPG substantially outperforms other baselines.

Download Full-text

What Makes A Good Story? Designing Composite Rewards for Visual Storytelling

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6305 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7969-7976

Author(s):

Junjie Hu ◽

Yu Cheng ◽

Zhe Gan ◽

Jingjing Liu ◽

Jianfeng Gao ◽

...

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Quality Criteria ◽

High Quality ◽

Visual Storytelling ◽

Learning Framework ◽

Good Story ◽

Human Evaluation ◽

Reward Functions ◽

New Criteria

Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr. In this paper, we re-examine this problem from a different angle, by looking deep into what defines a natural and topically-coherent story. To this end, we propose three assessment criteria: relevance, coherence and expressiveness, which we observe through empirical analysis could constitute a “high-quality” story to the human eye. We further propose a reinforcement learning framework, ReCo-RL, with reward functions designed to capture the essence of these quality criteria. Experiments on the Visual Storytelling Dataset (VIST) with both automatic and human evaluation demonstrate that our ReCo-RL model achieves better performance than state-of-the-art baselines on both traditional metrics and the proposed new criteria.

Download Full-text

Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/475 ◽

2019 ◽

Cited By ~ 1

Author(s):

Wenjie Shi ◽

Shiji Song ◽

Cheng Wu

Keyword(s):

Reinforcement Learning ◽

Maximum Entropy ◽

Bellman Equation ◽

Value Functions ◽

Policy Actor ◽

Model Free ◽

Policy Gradient ◽

Gradient Based ◽

Continuous Actions ◽

Stable Learning

Maximum entropy deep reinforcement learning (RL) methods have been demonstrated on a range of challenging continuous tasks. However, existing methods either suffer from severe instability when training on large off-policy data or cannot scale to tasks with very high state and action dimensionality such as 3D humanoid locomotion. Besides, the optimality of desired Boltzmann policy set for non-optimal soft value function is not persuasive enough. In this paper, we first derive soft policy gradient based on entropy regularized expected reward objective for RL with continuous actions. Then, we present an off-policy actor-critic, model-free maximum entropy deep RL algorithm called deep soft policy gradient (DSPG) by combining soft policy gradient with soft Bellman equation. To ensure stable learning while eliminating the need of two separate critics for soft value functions, we leverage double sampling approach to making the soft Bellman equation tractable. The experimental results demonstrate that our method outperforms in performance over off-policy prior methods.

Download Full-text

LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/840 ◽

2019 ◽

Cited By ~ 4

Author(s):

Alberto Camacho ◽

Rodrigo Toro Icarte ◽

Toryn Q. Klassen ◽

Richard Valenzano ◽

Sheila A. McIlraith

Keyword(s):

Reinforcement Learning ◽

Normal Form ◽

State Of The Art ◽

Formal Languages ◽

Function Structure ◽

Q Learning ◽

Reward Function ◽

Form Representation ◽

Reward Shaping ◽

Reward Functions

In Reinforcement Learning (RL), an agent is guided by the rewards it receives from the reward function. Unfortunately, it may take many interactions with the environment to learn from sparse rewards, and it can be challenging to specify reward functions that reflect complex reward-worthy behavior. We propose using reward machines (RMs), which are automata-based representations that expose reward function structure, as a normal form representation for reward functions. We show how specifications of reward in various formal languages, including LTL and other regular languages, can be automatically translated into RMs, easing the burden of complex reward function specification. We then show how the exposed structure of the reward function can be exploited by tailored q-learning algorithms and automated reward shaping techniques in order to improve the sample efficiency of reinforcement learning methods. Experiments show that these RM-tailored techniques significantly outperform state-of-the-art (deep) RL algorithms, solving problems that otherwise cannot reasonably be solved by existing approaches.

Download Full-text

Deep Reinforcement Learning: A State-of-the-Art Walkthrough

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.12412 ◽

2020 ◽

Vol 69 ◽

pp. 1421-1471

Author(s):

Aristotelis Lazaridis ◽

Anestis Fachantidis ◽

Ioannis Vlahavas

Keyword(s):

Reinforcement Learning ◽

High Performance ◽

Common Property ◽

State Of The Art ◽

Strong Field ◽

Model Free ◽

Benchmark Tests ◽

Physics Simulation ◽

Unique Nature ◽

Comprehensive Comparison

Deep Reinforcement Learning is a topic that has gained a lot of attention recently, due to the unprecedented achievements and remarkable performance of such algorithms in various benchmark tests and environmental setups. The power of such methods comes from the combination of an already established and strong field of Deep Learning, with the unique nature of Reinforcement Learning methods. It is, however, deemed necessary to provide a compact, accurate and comparable view of these methods and their results for the means of gaining valuable technical and practical insights. In this work we gather the essential methods related to Deep Reinforcement Learning, extracting common property structures for three complementary core categories: a) Model-Free, b) Model-Based and c) Modular algorithms. For each category, we present, analyze and compare state-of-the-art Deep Reinforcement Learning algorithms that achieve high performance in various environments and tackle challenging problems in complex and demanding tasks. In order to give a compact and practical overview of their differences, we present comprehensive comparison figures and tables, produced by reported performances of the algorithms under two popular simulation platforms: the Atari Learning Environment and the MuJoCo physics simulation platform. We discuss the key differences of the various kinds of algorithms, indicate their potential and limitations, as well as provide insights to researchers regarding future directions of the field.

Download Full-text

Design of Control Systems Using Active Uncertainty Reduction-Based Reinforcement Learning

Volume 11B: 46th Design Automation Conference (DAC) ◽

10.1115/detc2020-22014 ◽

2020 ◽

Author(s):

Zequn Wang ◽

Narendra Patwardhan

Keyword(s):

Reinforcement Learning ◽

Adaptive Sampling ◽

Original System ◽

Uncertainty Reduction ◽

Expected Improvement ◽

Model Based ◽

Model Free ◽

Reward Functions ◽

Policy Optimization ◽

Data Efficiency

Abstract Model-free reinforcement learning based methods such as Proximal Policy Optimization, or Q-learning typically require thousands of interactions with the environment to approximate the optimal controller which may not always be feasible in robotics due to safety and time consumption. Model-based methods such as PILCO or BlackDrops, while data-efficient, provide solutions with limited robustness and complexity. To address this tradeoff, we introduce active uncertainty reduction-based virtual environments, which are formed through limited trials conducted in the original environment. We provide an efficient method for uncertainty management, which is used as a metric for self-improvement by identification of the points with maximum expected improvement through adaptive sampling. Capturing the uncertainty also allows for better mimicking of the reward responses of the original system. Our approach enables the use of complex policy structures and reward functions through a unique combination of model-based and model-free methods, while still retaining the data efficiency. We demonstrate the validity of our method on several classic reinforcement learning problems in OpenAI gym. We prove that our approach offers a better modeling capacity for complex system dynamics as compared to established methods.

Download Full-text

Iteratively Questioning and Answering for Interpretable Legal Judgment Prediction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5479 ◽

2020 ◽

Vol 34 (01) ◽

pp. 1250-1257 ◽

Cited By ~ 1

Author(s):

Haoxi Zhong ◽

Yuzhong Wang ◽

Cunchao Tu ◽

Tianyang Zhang ◽

Zhiyuan Liu ◽

...

Keyword(s):

Reinforcement Learning ◽

Gender Bias ◽

Ethical Issues ◽

State Of The Art ◽

Presumption Of Innocence ◽

The World ◽

Comparable Performance ◽

Reward Functions ◽

Real World Datasets ◽

The Given

Legal Judgment Prediction (LJP) aims to predict judgment results according to the facts of cases. In recent years, LJP has drawn increasing attention rapidly from both academia and the legal industry, as it can provide references for legal practitioners and is expected to promote judicial justice. However, the research to date usually suffers from the lack of interpretability, which may lead to ethical issues like inconsistent judgments or gender bias. In this paper, we present QAjudge, a model based on reinforcement learning to visualize the prediction process and give interpretable judgments. QAjudge follows two essential principles in legal systems across the world: Presumption of Innocence and Elemental Trial. During inference, a Question Net will select questions from the given set and an Answer Net will answer the question according to the fact description. Finally, a Predict Net will produce judgment results based on the answers. Reward functions are designed to minimize the number of questions asked. We conduct extensive experiments on several real-world datasets. Experimental results show that QAjudge can provide interpretable judgments while maintaining comparable performance with other state-of-the-art LJP models. The codes can be found from https://github.com/thunlp/QAjudge.

Download Full-text

Efficient hindsight reinforcement learning using demonstrations for robotic tasks with sparse rewards

International Journal of Advanced Robotic Systems ◽

10.1177/1729881419898342 ◽

2020 ◽

Vol 17 (1) ◽

pp. 172988141989834

Author(s):

Guoyu Zuo ◽

Qishen Zhao ◽

Jiahao Lu ◽

Jiangeng Li

Keyword(s):

Reinforcement Learning ◽

Gradient Algorithm ◽

Learning To Learn ◽

Model Free ◽

Learning Speed ◽

Policy Gradient ◽

Experience Replay ◽

Speed Up ◽

Reward Functions ◽

Robotic Tasks

The goal of reinforcement learning is to enable an agent to learn by using rewards. However, some robotic tasks naturally specify with sparse rewards, and manually shaping reward functions is a difficult project. In this article, we propose a general and model-free approach for reinforcement learning to learn robotic tasks with sparse rewards. First, a variant of Hindsight Experience Replay, Curious and Aggressive Hindsight Experience Replay, is proposed to improve the sample efficiency of reinforcement learning methods and avoid the need for complicated reward engineering. Second, based on Twin Delayed Deep Deterministic policy gradient algorithm, demonstrations are leveraged to overcome the exploration problem and speed up the policy training process. Finally, the action loss is added into the loss function in order to minimize the vibration of output action while maximizing the value of the action. The experiments on simulated robotic tasks are performed with different hyperparameters to verify the effectiveness of our method. Results show that our method can effectively solve the sparse reward problem and obtain a high learning speed.

Download Full-text

Reinforcement Learning to Optimize the Treatment of Multiple Myeloma

Blood ◽

10.1182/blood-2019-132234 ◽

2019 ◽

Vol 134 (Supplement_1) ◽

pp. 5511-5511

Author(s):

Kenneth H. Shain ◽

Daniel Hart ◽

Ariosto Siqueira Silva ◽

Raghunandanreddy Alugubelli ◽

Gabriel De Avila ◽

...

Keyword(s):

Machine Learning ◽

Multiple Myeloma ◽

Big Data ◽

Reinforcement Learning ◽

Board Of Directors ◽

Research Funding ◽

Advisory Committees ◽

Model Free ◽

Future Reward ◽

Reward Functions

Over the last decade we have witnessed an explosion in the number of therapeutic options available to patients with multiple myeloma (MM). In spite of the marked improvements in patient outcomes paralleling these approvals, MM remains an incurable malignancy for the vast majority of patients following a course of therapeutic successes and failures. As such, there remains a dire need to develop new tools to improve the management of MM patients. A number of groups are leading efforts to combine big data and artificial intelligence to better inform patient care via precision medicine. At Moffitt, in collaboration with the M2Gen/ORIEN (Oncology Research Information Exchange Network), we have begun to accumulate big data in MM. Patients opt in to (consent) for collection of rich clinical data (demographics, staging, risk, complete disease course treatment data) and in the setting of bone marrow biopsy the allocation of CD138-selected cells for molecular analysis (whole exome sequencing (WES) and RNA sequencing as well as peripheral blood mononuclear cells for WES). To date, we have collected over 1000 samples for over 800 individual patients with plasma cell disorders. In the setting of oncology, the ultimate goal of model will be selection of ideal treatments. We expect that AI analysis may validate of patient response to treatments and enable cohort selection, as real patient cohorts can be selected from those predicted by the model. One approach is to utilize reinforcement learning (RL). In RL, the algorithm attempts to learn actions to optimize a type action a defined state and weight any tradeoffs for maximal reward. Our initial utilization of RL involved a relatively small cohort of 402 patients with treatment medication data. This encompassed 1692 lines of treatment with a mean of 4.21 lines of therapy per patient (Median of 4 lines per patient). This included 132 combinations of 22 myeloma therapeutics. The heterogeneity in treatment is highlighted by the fact that no pathways overlap after line 4. Each Q-value in Q-table is the current reward for an action in a state plus the discounted anticipated future reward for taking that action. Iteration helps you converge on the actual values for the future reward (can be model-free). The end result is a policy, P(s), that tells you what the ideal action is at state. There are a near infinite number of possible states, considering treatment history, age, GEP, cytogenetics, comorbidities, staging and others. We presume that action makes intuitive sense as medication (treatment) only and that reward should be some form of treatment response. We have begun the iterative process of trying different state and reward functions. Median imputation shows 5% improvement in response accuracy over listwise, but median imputation throws off practical accuracy in a binary reward case. While we found that the exercise has great potential. We found that there are possible improvements (multiple imputation). We will need to expand covariate analysis. Combinatorics need to be considered in machine learning in medium-sized data sets. Model-free machine learning is limited on medium-sized data. As such, combined resources and/or utilization of large networks such as ORIEN will be critical for the successful integration of RL or other AI tools in MM. We also learned that adding variables to the model doesn't necessarily increase accuracy. Future work will involve continued application of alternate state/reward functions. Loosen iQ-learning framework to allow for better covariate selection for state/reward functions. Improve imputation techniques to include more covariates and have more certainty in model accuracy. We may also refine accuracy metric to allow for prediction of bucketed response and temporal disease burden (M-spike vs. time). Updated data on a larger cohort will be presented at the annual meeting. Disclosures Shain: Adaptive Biotechnologies: Consultancy; Celgene: Membership on an entity's Board of Directors or advisory committees; Bristol-Myers Squibb: Membership on an entity's Board of Directors or advisory committees; Amgen: Membership on an entity's Board of Directors or advisory committees; Takeda: Membership on an entity's Board of Directors or advisory committees; Sanofi Genzyme: Membership on an entity's Board of Directors or advisory committees; AbbVie: Research Funding; Janssen: Membership on an entity's Board of Directors or advisory committees. Dai:M2Gen: Employment. Nishihori:Novartis: Research Funding; Karyopharm: Research Funding. Brayer:Janssen: Consultancy, Speakers Bureau; BMS: Consultancy, Speakers Bureau. Alsina:Bristol-Myers Squibb: Research Funding; Janssen: Speakers Bureau; Amgen: Speakers Bureau. Baz:Celgene: Membership on an entity's Board of Directors or advisory committees, Research Funding; Karyopharm: Membership on an entity's Board of Directors or advisory committees, Research Funding; AbbVie: Research Funding; Merck: Research Funding; Sanofi: Research Funding; Bristol-Myers Squibb: Research Funding. Dalton:MILLENNIUM PHARMACEUTICALS, INC.: Honoraria.

Download Full-text