Generation of Policy-Level Explanations for Reinforcement Learning

Though reinforcement learning has greatly benefited from the incorporation of neural networks, the inability to verify the correctness of such systems limits their use. Current work in explainable deep learning focuses on explaining only a single decision in terms of input features, making it unsuitable for explaining a sequence of decisions. To address this need, we introduce Abstracted Policy Graphs, which are Markov chains of abstract states. This representation concisely summarizes a policy so that individual decisions can be explained in the context of expected future transitions. Additionally, we propose a method to generate these Abstracted Policy Graphs for deterministic policies given a learned value function and a set of observed transitions, potentially off-policy transitions used during training. Since no restrictions are placed on how the value function is generated, our method is compatible with many existing reinforcement learning methods. We prove that the worst-case time complexity of our method is quadratic in the number of features and linear in the number of provided transitions, O(|F|2|tr samples|). By applying our method to a family of domains, we show that our method scales well in practice and produces Abstracted Policy Graphs which reliably capture relationships within these domains.

Download Full-text

Solving flow-shop scheduling problem with a reinforcement learning algorithm that generalizes the value function with neural network

Alexandria Engineering Journal ◽

10.1016/j.aej.2021.01.030 ◽

2021 ◽

Vol 60 (3) ◽

pp. 2787-2800

Author(s):

Jianfeng Ren ◽

Chunming Ye ◽

Feng Yang

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Value Function ◽

Flow Shop ◽

Learning Algorithm ◽

Flow Shop Scheduling ◽

Scheduling Problem ◽

Shop Scheduling ◽

The Value Function ◽

Reinforcement Learning Algorithm

Download Full-text

Diversity oriented Deep Reinforcement Learning for targeted molecule generation

Journal of Cheminformatics ◽

10.1186/s13321-021-00498-z ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Tiago Pereira ◽

Maryam Abbasi ◽

Bernardete Ribeiro ◽

Joel P. Arrais

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Reinforcement Learning ◽

Deep Neural Networks ◽

Chemical Space ◽

Biological Properties ◽

Training Process ◽

Training Strategy ◽

Inhibitory Power ◽

Exploratory Strategy

AbstractIn this work, we explore the potential of deep learning to streamline the process of identifying new potential drugs through the computational generation of molecules with interesting biological properties. Two deep neural networks compose our targeted generation framework: the Generator, which is trained to learn the building rules of valid molecules employing SMILES strings notation, and the Predictor which evaluates the newly generated compounds by predicting their affinity for the desired target. Then, the Generator is optimized through Reinforcement Learning to produce molecules with bespoken properties. The innovation of this approach is the exploratory strategy applied during the reinforcement training process that seeks to add novelty to the generated compounds. This training strategy employs two Generators interchangeably to sample new SMILES: the initially trained model that will remain fixed and a copy of the previous one that will be updated during the training to uncover the most promising molecules. The evolution of the reward assigned by the Predictor determines how often each one is employed to select the next token of the molecule. This strategy establishes a compromise between the need to acquire more information about the chemical space and the need to sample new molecules, with the experience gained so far. To demonstrate the effectiveness of the method, the Generator is trained to design molecules with an optimized coefficient of partition and also high inhibitory power against the Adenosine $$A_{2A}$$ A 2 A and $$\kappa$$ κ opioid receptors. The results reveal that the model can effectively adjust the newly generated molecules towards the wanted direction. More importantly, it was possible to find promising sets of unique and diverse molecules, which was the main purpose of the newly implemented strategy.

Download Full-text

OBLIGATION BLACKWELL GAMES AND P-AUTOMATA

Journal of Symbolic Logic ◽

10.1017/jsl.2016.71 ◽

2017 ◽

Vol 82 (2) ◽

pp. 420-452

Author(s):

KRISHNENDU CHATTERJEE ◽

NIR PITERMAN

Keyword(s):

Markov Chains ◽

Decision Problem ◽

Value Function ◽

Acceptance Condition ◽

Parity Games ◽

The Value Function

AbstractWe generalize winning conditions in two-player games by adding a structural acceptance condition called obligations. Obligations are orthogonal to the linear winning conditions that define whether a play is winning. Obligations are a declaration that player 0 can achieve a certain value from a configuration. If the obligation is met, the value of that configuration for player 0 is 1.We define the value in such games and show that obligation games are determined. For Markov chains with Borel objectives and obligations, and finite turn-based stochastic parity games with obligations we give an alternative and simpler characterization of the value function. Based on this simpler definition we show that the decision problem of winning finite turn-based stochastic parity games with obligations is in NP∩co-NP. We also show that obligation games provide a game framework for reasoning about p-automata.

Download Full-text

Deep Learning and Autoregressive Approach for Prediction of Time Series Data

Journal of Autonomous Intelligence ◽

10.32629/jai.v3i2.207 ◽

2021 ◽

Vol 3 (2) ◽

pp. 1

Author(s):

Akhter Mohiuddin Rather

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Deep Learning ◽

Statistical Models ◽

Time Complexity ◽

Time Series Data ◽

Series Data ◽

Learning Approach ◽

Proposed Model ◽

Artificial Neural

Fractional This paper proposes a deep learning approach for prediction of nonstationary data. A new regression scheme has been used in the proposed model. Any non-stationary data can be used to test the efficiency of the proposed model, however in this work stock data has been used due to the fact that stock data has a property of being nonlinear or non-stationary in nature. Beside using proposed model, predictions were also obtained using some statistical models and artificial neural networks. Traditional statistical models did not yield any expected results; artificial neural networks resulted into high time complexity. Therefore, deep learning approach seemed to be the best method as of today in dealing with such problems wherein time complexity and excellent predictions are of concern.

Download Full-text

Risk-Sensitive Reinforcement Learning Applied to Control under Constraints

Journal of Artificial Intelligence Research ◽

10.1613/jair.1666 ◽

2005 ◽

Vol 24 ◽

pp. 81-108 ◽

Cited By ~ 65

Author(s):

P. Geibel ◽

F. Wysotzki

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Learning Algorithm ◽

Optimal Solution ◽

Feed Tank ◽

Model Free ◽

Constrained Problem ◽

Risk Sensitive ◽

Markov Decision ◽

The Value Function

In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.

Download Full-text

Autonomous underwater vehicle path planning based on actor-multi-critic reinforcement learning

Proceedings of the Institution of Mechanical Engineers Part I Journal of Systems and Control Engineering ◽

10.1177/0959651820937085 ◽

2020 ◽

pp. 095965182093708

Author(s):

Zhuo Wang ◽

Shiwei Zhang ◽

Xiaoning Feng ◽

Yancheng Sui

Keyword(s):

Reinforcement Learning ◽

Path Planning ◽

Value Function ◽

Autonomous Underwater Vehicle ◽

Autonomous Underwater Vehicles ◽

Underwater Vehicle ◽

Learning Efficiency ◽

Environmental Adaptability ◽

Vehicle Path ◽

The Value Function

The environmental adaptability of autonomous underwater vehicles is always a problem for its path planning. Although reinforcement learning can improve the environmental adaptability, the slow convergence of reinforcement learning is caused by multi-behavior coupling, so it is difficult for autonomous underwater vehicle to avoid moving obstacles. This article proposes a multi-behavior critic reinforcement learning algorithm applied to autonomous underwater vehicle path planning to overcome problems associated with oscillating amplitudes and low learning efficiency in the early stages of training which are common in traditional actor–critic algorithms. Behavior critic reinforcement learning assesses the actions of the actor from perspectives such as energy saving and security, combining these aspects into a whole evaluation of the actor. In this article, the policy gradient method is selected as the actor part, and the value function method is selected as the critic part. The strategy gradient and the value function methods for actor and critic, respectively, are approximated by a backpropagation neural network, the parameters of which are updated using the gradient descent method. The simulation results show that the method has the ability of optimizing learning in the environment and can improve learning efficiency, which meets the needs of real time and adaptability for autonomous underwater vehicle dynamic obstacle avoidance.

Download Full-text

Surveys without Questions: A Reinforcement Learning Approach

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301257 ◽

2019 ◽

Vol 33 ◽

pp. 257-264 ◽

Cited By ~ 1

Author(s):

Atanu R Sinha ◽

Deepali Jain ◽

Nikhil Sheoran ◽

Sopan Khosla ◽

Reshmi Sasidharan

Keyword(s):

Reinforcement Learning ◽

Survey Data ◽

Value Function ◽

Specific Interactions ◽

Aggregate Level ◽

Clickstream Data ◽

Online Interactions ◽

Performance Results ◽

The Value Function ◽

Over Time

The ‘old world’ instrument, survey, remains a tool of choice for firms to obtain ratings of satisfaction and experience that customers realize while interacting online with firms. While avenues for survey have evolved from emails and links to pop-ups while browsing, the deficiencies persist. These include - reliance on ratings of very few respondents to infer about all customers’ online interactions; failing to capture a customer’s interactions over time since the rating is a one-time snapshot; and inability to tie back customers’ ratings to specific interactions because ratings provided relate to all interactions. To overcome these deficiencies we extract proxy ratings from clickstream data, typically collected for every customer’s online interactions, by developing an approach based on Reinforcement Learning (RL). We introduce a new way to interpret values generated by the value function of RL, as proxy ratings. Our approach does not need any survey data for training. Yet, on validation against actual survey data, proxy ratings yield reasonable performance results. Additionally, we offer a new way to draw insights from values of the value function, which allow associating specific interactions to their proxy ratings. We introduce two new metrics to represent ratings - one, customer-level and the other, aggregate-level for click actions across customers. Both are defined around proportion of all pairwise, successive actions that show increase in proxy ratings. This intuitive customer-level metric enables gauging the dynamics of ratings over time and is a better predictor of purchase than customer ratings from survey. The aggregate-level metric allows pinpointing actions that help or hurt experience. In sum, proxy ratings computed unobtrusively from clickstream, for every action, for each customer, and for every session can offer interpretable and more insightful alternative to surveys.

Download Full-text

Approximating the value function for continuous space reinforcement learning in robot control

IEEE/RSJ International Conference on Intelligent Robots and System ◽

10.1109/irds.2002.1041532 ◽

2003 ◽

Cited By ~ 2

Author(s):

S. Buck ◽

M. Beetz ◽

T. Schmitt

Keyword(s):

Reinforcement Learning ◽

Robot Control ◽

Value Function ◽

Continuous Space ◽

The Value Function

Download Full-text

Reinforcement Learning for Hyperparameter Tuning in Deep Learning-based Side-channel Analysis

IACR Transactions on Cryptographic Hardware and Embedded Systems ◽

10.46586/tches.v2021.i3.677-707 ◽

2021 ◽

pp. 677-707

Author(s):

Jorai Rijsdijk ◽

Lichao Wu ◽

Guilherme Perin ◽

Stjepan Picek

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Learning ◽

Reinforcement Learning ◽

Convolutional Neural Networks ◽

Random Search ◽

High Price ◽

Side Channel ◽

Q Learning ◽

Reward Functions

Deep learning represents a powerful set of techniques for profiling sidechannel analysis. The results in the last few years show that neural network architectures like multilayer perceptron and convolutional neural networks give strong attack performance where it is possible to break targets protected with various countermeasures. Considering that deep learning techniques commonly have a plethora of hyperparameters to tune, it is clear that such top attack results can come with a high price in preparing the attack. This is especially problematic as the side-channel community commonly uses random search or grid search techniques to look for the best hyperparameters.In this paper, we propose to use reinforcement learning to tune the convolutional neural network hyperparameters. In our framework, we investigate the Q-Learning paradigm and develop two reward functions that use side-channel metrics. We mount an investigation on three commonly used datasets and two leakage models where the results show that reinforcement learning can find convolutional neural networks exhibiting top performance while having small numbers of trainable parameters. We note that our approach is automated and can be easily adapted to different datasets. Several of our newly developed architectures outperform the current state-of-the-art results. Finally, we make our source code publicly available. https://github.com/AISyLab/Reinforcement-Learning-for-SCA

Download Full-text

Statistical inference of the value function for reinforcement learning in infinite‐horizon settings

Journal of the Royal Statistical Society Series B (Statistical Methodology) ◽

10.1111/rssb.12465 ◽

2021 ◽

Author(s):

Chengchun Shi ◽

Sheng Zhang ◽

Wenbin Lu ◽

Rui Song

Keyword(s):

Reinforcement Learning ◽

Statistical Inference ◽

Value Function ◽

Infinite Horizon ◽

The Value Function

Download Full-text