Reinforcement Learning with Dynamic Boltzmann Softmax Updates

Value function estimation is an important task in reinforcement learning, i.e., prediction. The Boltzmann softmax operator is a natural value estimator and can provide several benefits. However, it does not satisfy the non-expansion property, and its direct use may fail to converge even in value iteration. In this paper, we propose to update the value function with dynamic Boltzmann softmax (DBS) operator, which has good convergence property in the setting of planning and learning. Experimental results on GridWorld show that the DBS operator enables better estimation of the value function, which rectifies the convergence issue of the softmax operator. Finally, we propose the DBS-DQN algorithm by applying the DBS operator, which outperforms DQN substantially in 40 out of 49 Atari games.

Download Full-text

Approximate Value Iteration with Temporally Extended Actions

Journal of Artificial Intelligence Research ◽

10.1613/jair.4676 ◽

2015 ◽

Vol 53 ◽

pp. 375-438 ◽

Cited By ~ 3

Author(s):

Timothy A. Mann ◽

Shie Mannor ◽

Doina Precup

Keyword(s):

Reinforcement Learning ◽

Theoretical Analysis ◽

Convergence Rate ◽

Value Function ◽

Approximation Error ◽

Experimental Results ◽

Value Iteration ◽

Efficient Planning ◽

Approximate Value Iteration ◽

The Value Function

Temporally extended actions have proven useful for reinforcement learning, but their duration also makes them valuable for efficient planning. The options framework provides a concrete way to implement and reason about temporally extended actions. Existing literature has demonstrated the value of planning with options empirically, but there is a lack of theoretical analysis formalizing when planning with options is more efficient than planning with primitive actions. We provide a general analysis of the convergence rate of a popular Approximate Value Iteration (AVI) algorithm called Fitted Value Iteration (FVI) with options. Our analysis reveals that longer duration options and a pessimistic estimate of the value function both lead to faster convergence. Furthermore, options can improve convergence even when they are suboptimal and sparsely distributed throughout the state-space. Next we consider the problem of generating useful options for planning based on a subset of landmark states. This suggests a new algorithm, Landmark-based AVI (LAVI), that represents the value function only at the landmark states. We analyze both FVI and LAVI using the proposed landmark-based options and compare the two algorithms. Our experimental results in three different domains demonstrate the key properties from the analysis. Our theoretical and experimental results demonstrate that options can play an important role in AVI by decreasing approximation error and inducing fast convergence.

Download Full-text

Approximate Value Iteration with Temporally Extended Actions (Extended Abstract)

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/717 ◽

2017 ◽

Author(s):

Timothy A. Mann ◽

Shie Mannor ◽

Doina Precup

Keyword(s):

Theoretical Analysis ◽

Convergence Rate ◽

State Space ◽

Value Function ◽

Approximation Error ◽

General Analysis ◽

Experimental Results ◽

Value Iteration ◽

Approximate Value Iteration ◽

The Value Function

The options framework provides a concrete way to implement and reason about temporally extended actions. Existing literature has demonstrated the value of planning with options empirically, but there is a lack of theoretical analysis formalizing when planning with options is more efficient than planning with primitive actions. We provide a general analysis of the convergence rate of a popular Approximate Value Iteration (AVI) algorithm called Fitted Value Iteration (FVI) with options. Our analysis reveals that longer duration options and a pessimistic estimate of the value function both lead to faster convergence. Furthermore, options can improve convergence even when they are suboptimal and sparsely distributed throughout the state space. Next we consider generating useful options for planning based on a subset of landmark states. This suggests a new algorithm, Landmark-based AVI (LAVI), that represents the value function only at landmark states. We analyze OFVI and LAVI using the proposed landmark-based options and compare the two algorithms. Our theoretical and experimental results demonstrate that options can play an important role in AVI by decreasing approximation error and inducing fast convergence.

Download Full-text

Solving flow-shop scheduling problem with a reinforcement learning algorithm that generalizes the value function with neural network

Alexandria Engineering Journal ◽

10.1016/j.aej.2021.01.030 ◽

2021 ◽

Vol 60 (3) ◽

pp. 2787-2800

Author(s):

Jianfeng Ren ◽

Chunming Ye ◽

Feng Yang

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Value Function ◽

Flow Shop ◽

Learning Algorithm ◽

Flow Shop Scheduling ◽

Scheduling Problem ◽

Shop Scheduling ◽

The Value Function ◽

Reinforcement Learning Algorithm

Download Full-text

Incremental State Aggregation for Value Function Estimation in Reinforcement Learning

IEEE Transactions on Systems Man and Cybernetics Part B (Cybernetics) ◽

10.1109/tsmcb.2011.2148710 ◽

2011 ◽

Vol 41 (5) ◽

pp. 1407-1416 ◽

Cited By ~ 10

Author(s):

T. Mori ◽

S. Ishii

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Function Estimation ◽

State Aggregation

Download Full-text

Risk-Sensitive Reinforcement Learning Applied to Control under Constraints

Journal of Artificial Intelligence Research ◽

10.1613/jair.1666 ◽

2005 ◽

Vol 24 ◽

pp. 81-108 ◽

Cited By ~ 65

Author(s):

P. Geibel ◽

F. Wysotzki

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Learning Algorithm ◽

Optimal Solution ◽

Feed Tank ◽

Model Free ◽

Constrained Problem ◽

Risk Sensitive ◽

Markov Decision ◽

The Value Function

In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.

Download Full-text

Implementation of English “Online and Offline” Hybrid Teaching Recommendation Platform Based on Reinforcement Learning

Security and Communication Networks ◽

10.1155/2021/4875330 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Danling Dong ◽

Libo Wu

Keyword(s):

Reinforcement Learning ◽

Online Teaching ◽

Value Function ◽

Cold Start ◽

The State ◽

User Evaluation ◽

Function Estimation ◽

Recommendation Algorithm ◽

Target User ◽

Hybrid Teaching

At present, there is a serious disconnect between online teaching and offline teaching in English MOOC large-scale hybrid teaching recommendation platform, which is mainly due to the problems of cold start and matrix sparsity in the recommendation algorithm, and it is difficult to fully tap the user's interest characteristics because it only considers the user's rating but neglects the user's personalized evaluation. In order to solve the above problems, this paper proposes to use reinforcement learning thought and user evaluation factors to realize the online and offline hybrid English teaching recommendation platform. First, the idea of value function estimation in reinforcement learning is introduced, and the difference between user state value functions is used to replace the previous similarity calculation method, thus alleviating the matrix sparsity problem. The learning rate is used to control the convergence speed of the weight vector in the user state value function to alleviate the cold start problem. Second, by adding the learning of the user evaluation vector to the value function estimation of the state value function, the state value function of the user can be estimated approximately and the discrimination degree of the target user can be reflected. Experimental results show that the proposed recommendation algorithm can effectively alleviate the cold start and matrix sparsity problems existing in the current collaborative filtering recommendation algorithm and can dig deep into the characteristics of users' interests and further improve the accuracy of scoring prediction.

Download Full-text

Autonomous underwater vehicle path planning based on actor-multi-critic reinforcement learning

Proceedings of the Institution of Mechanical Engineers Part I Journal of Systems and Control Engineering ◽

10.1177/0959651820937085 ◽

2020 ◽

pp. 095965182093708

Author(s):

Zhuo Wang ◽

Shiwei Zhang ◽

Xiaoning Feng ◽

Yancheng Sui

Keyword(s):

Reinforcement Learning ◽

Path Planning ◽

Value Function ◽

Autonomous Underwater Vehicle ◽

Autonomous Underwater Vehicles ◽

Underwater Vehicle ◽

Learning Efficiency ◽

Environmental Adaptability ◽

Vehicle Path ◽

The Value Function

The environmental adaptability of autonomous underwater vehicles is always a problem for its path planning. Although reinforcement learning can improve the environmental adaptability, the slow convergence of reinforcement learning is caused by multi-behavior coupling, so it is difficult for autonomous underwater vehicle to avoid moving obstacles. This article proposes a multi-behavior critic reinforcement learning algorithm applied to autonomous underwater vehicle path planning to overcome problems associated with oscillating amplitudes and low learning efficiency in the early stages of training which are common in traditional actor–critic algorithms. Behavior critic reinforcement learning assesses the actions of the actor from perspectives such as energy saving and security, combining these aspects into a whole evaluation of the actor. In this article, the policy gradient method is selected as the actor part, and the value function method is selected as the critic part. The strategy gradient and the value function methods for actor and critic, respectively, are approximated by a backpropagation neural network, the parameters of which are updated using the gradient descent method. The simulation results show that the method has the ability of optimizing learning in the environment and can improve learning efficiency, which meets the needs of real time and adaptability for autonomous underwater vehicle dynamic obstacle avoidance.

Download Full-text

Generation of Policy-Level Explanations for Reinforcement Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33012514 ◽

2019 ◽

Vol 33 ◽

pp. 2514-2521

Author(s):

Nicholay Topin ◽

Manuela Veloso

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Reinforcement Learning ◽

Markov Chains ◽

Time Complexity ◽

Value Function ◽

Worst Case ◽

Policy Level ◽

Individual Decisions ◽

The Value Function

Though reinforcement learning has greatly benefited from the incorporation of neural networks, the inability to verify the correctness of such systems limits their use. Current work in explainable deep learning focuses on explaining only a single decision in terms of input features, making it unsuitable for explaining a sequence of decisions. To address this need, we introduce Abstracted Policy Graphs, which are Markov chains of abstract states. This representation concisely summarizes a policy so that individual decisions can be explained in the context of expected future transitions. Additionally, we propose a method to generate these Abstracted Policy Graphs for deterministic policies given a learned value function and a set of observed transitions, potentially off-policy transitions used during training. Since no restrictions are placed on how the value function is generated, our method is compatible with many existing reinforcement learning methods. We prove that the worst-case time complexity of our method is quadratic in the number of features and linear in the number of provided transitions, O(|F|2|tr samples|). By applying our method to a family of domains, we show that our method scales well in practice and produces Abstracted Policy Graphs which reliably capture relationships within these domains.

Download Full-text

Surveys without Questions: A Reinforcement Learning Approach

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301257 ◽

2019 ◽

Vol 33 ◽

pp. 257-264 ◽

Cited By ~ 1

Author(s):

Atanu R Sinha ◽

Deepali Jain ◽

Nikhil Sheoran ◽

Sopan Khosla ◽

Reshmi Sasidharan

Keyword(s):

Reinforcement Learning ◽

Survey Data ◽

Value Function ◽

Specific Interactions ◽

Aggregate Level ◽

Clickstream Data ◽

Online Interactions ◽

Performance Results ◽

The Value Function ◽

Over Time

The ‘old world’ instrument, survey, remains a tool of choice for firms to obtain ratings of satisfaction and experience that customers realize while interacting online with firms. While avenues for survey have evolved from emails and links to pop-ups while browsing, the deficiencies persist. These include - reliance on ratings of very few respondents to infer about all customers’ online interactions; failing to capture a customer’s interactions over time since the rating is a one-time snapshot; and inability to tie back customers’ ratings to specific interactions because ratings provided relate to all interactions. To overcome these deficiencies we extract proxy ratings from clickstream data, typically collected for every customer’s online interactions, by developing an approach based on Reinforcement Learning (RL). We introduce a new way to interpret values generated by the value function of RL, as proxy ratings. Our approach does not need any survey data for training. Yet, on validation against actual survey data, proxy ratings yield reasonable performance results. Additionally, we offer a new way to draw insights from values of the value function, which allow associating specific interactions to their proxy ratings. We introduce two new metrics to represent ratings - one, customer-level and the other, aggregate-level for click actions across customers. Both are defined around proportion of all pairwise, successive actions that show increase in proxy ratings. This intuitive customer-level metric enables gauging the dynamics of ratings over time and is a better predictor of purchase than customer ratings from survey. The aggregate-level metric allows pinpointing actions that help or hurt experience. In sum, proxy ratings computed unobtrusively from clickstream, for every action, for each customer, and for every session can offer interpretable and more insightful alternative to surveys.

Download Full-text

Approximating the value function for continuous space reinforcement learning in robot control

IEEE/RSJ International Conference on Intelligent Robots and System ◽

10.1109/irds.2002.1041532 ◽

2003 ◽

Cited By ~ 2

Author(s):

S. Buck ◽

M. Beetz ◽

T. Schmitt

Keyword(s):

Reinforcement Learning ◽

Robot Control ◽

Value Function ◽

Continuous Space ◽

The Value Function

Download Full-text