Instance-Dependent ℓ∞-Bounds for Policy Evaluation in Tabular Reinforcement Learning

Organizations need to constantly learn, develop, and evaluate new strategies and policies for their effective operation. Unsupervised reinforcement learning is becoming a highly useful tool, since rewards and punishments in different forms are pervasive and present in a wide variety of decision-making scenarios. By observing the outcome of a sufficient number of repeated trials, one would gradually learn the value and usefulness of a particular policy or strategy. However, in a given environment, the outcomes resulting from different trials are subject to external chance influence and variations. In learning about the usefulness of a given policy, significant costs are involved in systematically undertaking the sequential trials; therefore, in most learning episodes, one would wish to keep the cost within bounds by adopting learning efficient stopping rules. In this Chapter, we explain the deployment of different learning strategies in given environments for reinforcement learning policy evaluation and review, and we present suggestions for their practical use and applications.

Download Full-text

Error Bounds in Reinforcement Learning Policy Evaluation

Advances in Artificial Intelligence - Lecture Notes in Computer Science ◽

10.1007/11424918_48 ◽

2005 ◽

pp. 438-449

Author(s):

Fletcher Lu

Keyword(s):

Reinforcement Learning ◽

Error Bounds ◽

Policy Evaluation

Download Full-text

A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5779 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3701-3708

Author(s):

Gal Dalal ◽

Balazs Szorenyi ◽

Gugan Thoppe

Keyword(s):

Reinforcement Learning ◽

Convergence Rate ◽

Policy Evaluation ◽

Finite Time ◽

High Probability ◽

Temporal Difference ◽

Time Analysis ◽

Difference Methods ◽

Temporal Difference Methods ◽

Two Timescale Stochastic Approximation

Policy evaluation in reinforcement learning is often conducted using two-timescale stochastic approximation, which results in various gradient temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide convergence rate bounds for this suite of algorithms. Algorithms such as these have two iterates, θn and wn, which are updated using two distinct stepsize sequences, αn and βn, respectively. Assuming αn = n−α and βn = n−β with 1 > α > β > 0, we show that, with high probability, the two iterates converge to their respective solutions θ* and w* at rates given by ∥θn - θ*∥ = Õ(n−α/2) and ∥wn - w*∥ = Õ(n−β/2); here, Õ hides logarithmic terms. Via comparable lower bounds, we show that these bounds are, in fact, tight. To the best of our knowledge, ours is the first finite-time analysis which achieves these rates. While it was known that the two timescale components decouple asymptotically, our results depict this phenomenon more explicitly by showing that it in fact happens from some finite time onwards. Lastly, compared to existing works, our result applies to a broader family of stepsizes, including non-square summable ones.

Download Full-text

Fully asynchronous policy evaluation in distributed reinforcement learning over networks

Automatica ◽

10.1016/j.automatica.2021.110092 ◽

2022 ◽

Vol 136 ◽

pp. 110092

Author(s):

Xingyu Sha ◽

Jiaqi Zhang ◽

Keyou You ◽

Kaiqing Zhang ◽

Tamer Başar

Keyword(s):

Reinforcement Learning ◽

Policy Evaluation ◽

Distributed Reinforcement

Download Full-text

Learning Sparse Representations in Reinforcement Learning with Sparse Coding

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/287 ◽

2017 ◽

Author(s):

Lei Le ◽

Raksha Kumaraswamy ◽

Martha White

Keyword(s):

Reinforcement Learning ◽

Sparse Representation ◽

Policy Evaluation ◽

Sparse Coding ◽

Representation Learning ◽

Sparse Representations ◽

Learning Approaches ◽

Local Minima ◽

Global Minima ◽

Tile Coding

A variety of representation learning approaches have been investigated for reinforcement learning; much less attention, however, has been given to investigating the utility of sparse coding. Outside of reinforcement learning, sparse coding representations have been widely used, with non-convex objectives that result in discriminative representations. In this work, we develop a supervised sparse coding objective for policy evaluation. Despite the non-convexity of this objective, we prove that all local minima are global minima, making the approach amenable to simple optimization strategies. We empirically show that it is key to use a supervised objective, rather than the more straightforward unsupervised sparse coding approach. We then compare the learned representations to a canonical fixed sparse representation, called tile-coding, demonstrating that the sparse coding representation outperforms a wide variety of tile-coding representations.

Download Full-text

Distributed policy evaluation via inexact ADMM in multi-agent reinforcement learning

Control Theory and Technology ◽

10.1007/s11768-020-00007-x ◽

2020 ◽

Vol 18 (4) ◽

pp. 362-378

Author(s):

Xiaoxiao Zhao ◽

Peng Yi ◽

Li Li

Keyword(s):

Reinforcement Learning ◽

Policy Evaluation ◽

Multi Agent

Download Full-text

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Operations Research ◽

10.1287/opre.2021.2151 ◽

2021 ◽

Author(s):

Shicong Cen ◽

Chen Cheng ◽

Yuxin Chen ◽

Yuting Wei ◽

Yuejie Chi

Keyword(s):

Reinforcement Learning ◽

Global Convergence ◽

Policy Evaluation ◽

Gradient Methods ◽

Convergence Result ◽

Learning Rates ◽

Wide Range ◽

Policy Gradient ◽

Markov Decision ◽

Policy Optimization

Preconditioning and Regularization Enable Faster Reinforcement Learning Natural policy gradient (NPG) methods, in conjunction with entropy regularization to encourage exploration, are among the most popular policy optimization algorithms in contemporary reinforcement learning. Despite the empirical success, the theoretical underpinnings for NPG methods remain severely limited. In “Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization”, Cen, Cheng, Chen, Wei, and Chi develop nonasymptotic convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on tabular discounted Markov decision processes. Assuming access to exact policy evaluation, the authors demonstrate that the algorithm converges linearly at an astonishing rate that is independent of the dimension of the state-action space. Moreover, the algorithm is provably stable vis-à-vis inexactness of policy evaluation. Accommodating a wide range of learning rates, this convergence result highlights the role of preconditioning and regularization in enabling fast convergence.

Download Full-text