Riemannian Proximal Policy Optimization

In this paper, we propose a general Riemannian proximal optimization algorithm with guaranteed convergence to solve Markov decision process (MDP) problems. To model policy functions in MDP, we employ Gaussian mixture model (GMM) and formulate it as a non-convex optimization problem in the Riemannian space of positive semidefinite matrices. For two given policy functions, we also provide its lower bound on policy improvement by using bounds derived from the Wasserstein distance of GMMs. Preliminary experiments show the efficacy of our proposed Riemannian proximal policy optimization algorithm.

Download Full-text

Inverse reinforcement learning in contextual MDPs

Machine Learning ◽

10.1007/s10994-021-05984-x ◽

2021 ◽

Author(s):

Stav Belogolovsky ◽

Philip Korsunsky ◽

Shie Mannor ◽

Chen Tessler ◽

Tom Zahavy

Keyword(s):

Reinforcement Learning ◽

Optimization Problem ◽

Decision Processes ◽

Inverse Reinforcement Learning ◽

Convex Optimization Problem ◽

Reward Function ◽

Dynamic Treatment Regime ◽

Markov Decision ◽

Dynamic Treatment ◽

Recorded Data

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Download Full-text

The manufacturing resource optimization problem of federal collaborative development based on Markov Decision Process

Journal of Physics Conference Series ◽

10.1088/1742-6596/1074/1/012182 ◽

2018 ◽

Vol 1074 ◽

pp. 012182

Author(s):

Qiang Qin ◽

Jianxin Yang ◽

Jun Ji ◽

Wenjun Liu ◽

Yi-ming Yang ◽

...

Keyword(s):

Markov Decision Process ◽

Decision Process ◽

Optimization Problem ◽

Resource Optimization ◽

Collaborative Development ◽

Markov Decision ◽

Manufacturing Resource

Download Full-text

About the non-convex optimization problem induced by non-positive semidefinite kernel learning

Advances in Data Analysis and Classification ◽

10.1007/s11634-008-0033-4 ◽

2008 ◽

Vol 2 (3) ◽

pp. 241-258 ◽

Cited By ~ 7

Author(s):

Ingo Mierswa ◽

Katharina Morik

Keyword(s):

Convex Optimization ◽

Optimization Problem ◽

Positive Semidefinite ◽

Kernel Learning ◽

Convex Optimization Problem

Download Full-text

Edge Caching for D2D Enabled Hierarchical Wireless Networks with Deep Reinforcement Learning

Wireless Communications and Mobile Computing ◽

10.1155/2019/2561069 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 5

Author(s):

Wenkai Li ◽

Chenyang Wang ◽

Ding Li ◽

Bin Hu ◽

Xiaofei Wang ◽

...

Keyword(s):

Decision Process ◽

Large Scale ◽

Optimization Problem ◽

Base Stations ◽

Learning Framework ◽

Markov Decision ◽

Low Efficiency ◽

Replacement Problem ◽

User Device ◽

Edge Caching

Edge caching is a promising method to deal with the traffic explosion problem towards future network. In order to satisfy the demands of user requests, the contents can be proactively cached locally at the proximity to users (e.g., base stations or user device). Recently, some learning-based edge caching optimizations are discussed. However, most of the previous studies explore the influence of dynamic and constant expanding action and caching space, leading to unpracticality and low efficiency. In this paper, we study the edge caching optimization problem by utilizing the Double Deep Q-network (Double DQN) learning framework to maximize the hit rate of user requests. Firstly, we obtain the Device-to-Device (D2D) sharing model by considering both online and offline factors and then we formulate the optimization problem, which is proved as NP-hard. Then the edge caching replacement problem is derived by Markov decision process (MDP). Finally, an edge caching strategy based on Double DQN is proposed. The experimental results based on large-scale actual traces show the effectiveness of the proposed framework.

Download Full-text

Reinforcement Learning Enables Field-Development Policy Optimization

Journal of Petroleum Technology ◽

10.2118/0921-0046-jpt ◽

2021 ◽

Vol 73 (09) ◽

pp. 46-47

Author(s):

Chris Carpenter

Keyword(s):

Dynamic Programming ◽

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Development Policy ◽

The State ◽

Field Development ◽

Markov Decision ◽

Policy Optimization ◽

Sequential Nature

This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 201254, “Reinforcement Learning for Field-Development Policy Optimization,” by Giorgio De Paola, SPE, and Cristina Ibanez-Llano, Repsol, and Jesus Rios, IBM, et al., prepared for the 2020 SPE Annual Technical Conference and Exhibition, originally scheduled to be held in Denver, Colorado, 5–7 October. The paper has not been peer reviewed. A field-development plan consists of a sequence of decisions. Each action taken affects the reservoir and conditions any future decision. The presence of uncertainty associated with this process, however, is undeniable. The novelty of the approach proposed by the authors in the complete paper is the consideration of the sequential nature of the decisions through the framework of dynamic programming (DP) and reinforcement learning (RL). This methodology allows moving the focus from a static field-development plan optimization to a more-dynamic framework that the authors call field-development policy optimization. This synopsis focuses on the methodology, while the complete paper also contains a real-field case of application of the methodology. Methodology Deep RL (DRL). RL is considered an important learning paradigm in artificial intelligence (AI) but differs from supervised or unsupervised learning, the most commonly known types currently studied in the field of machine learning. During the last decade, RL has attracted greater attention because of success obtained in applications related to games and self-driving cars resulting from its combination with deep-learning architectures such as DRL, which has allowed RL to scale on to previously unsolvable problems and, therefore, solve much larger sequential decision problems. RL, also referred to as stochastic approximate dynamic programming, is a goal-directed sequential-learning-from-interaction paradigm. The learner or agent is not told what to do but instead has to learn which actions or decisions yield a maximum reward through interaction with an uncertain environment without losing too much reward along the way. This way of learning from interaction to achieve a goal must be achieved in balance with the exploration and exploitation of possible actions. Another key characteristic of this type of problem is its sequential nature, where the actions taken by the agent affect the environment itself and, therefore, the subsequent data it receives and the subsequent actions to be taken. Mathematically, such problems are formulated in the framework of the Markov decision process (MDP) that primarily arises in the field of optimal control. An RL problem consists of two principal parts: the agent, or decision-making engine, and the environment, the interactive world for an agent (in this case, the reservoir). Sequentially, at each timestep, the agent takes an action (e.g., changing control rates or deciding a well location) that makes the environment (reservoir) transition from one state to another. Next, the agent receives a reward (e.g., a cash flow) and an observation of the state of the environment (partial or total) before taking the next action. All relevant information informing the agent of the state of the system is assumed to be included in the last state observed by the agent (Markov property). If the agent observes the full environment state once it has acted, the MDP is said to be fully observable; otherwise, a partially observable Markov decision process (POMDP) results. The agent’s objective is to learn policy mapping from states (MDPs) or histories (POMDPs) to actions such that the agent’s cumulated (discounted) reward in the long run is maximized.

Download Full-text

An optimization problem on subsets of the symmetric positive-semidefinite matrices

Journal of Optimization Theory and Applications ◽

10.1007/bf00940556 ◽

1993 ◽

Vol 79 (3) ◽

pp. 513-524 ◽

Cited By ~ 3

Author(s):

P. Tarazaga ◽

M. W. Trosset

Keyword(s):

Optimization Problem ◽

Positive Semidefinite ◽

Positive Semidefinite Matrices

Download Full-text

Markov Decision Process Based Multi-agent System Applied to Aeroengine Maintenance Policy Optimization

2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery ◽

10.1109/fskd.2008.427 ◽

2008 ◽

Cited By ~ 3

Author(s):

Jianrong Wang ◽

Shouming Hou ◽

Yingying Su ◽

Jianwei Du ◽

Wanshan Wang

Keyword(s):

Markov Decision Process ◽

Decision Process ◽

Multi Agent System ◽

Maintenance Policy ◽

Agent System ◽

Markov Decision ◽

Multi Agent ◽

Policy Optimization

Download Full-text

Quantum entropic regularization of matrix-valued optimal transport

European Journal of Applied Mathematics ◽

10.1017/s0956792517000274 ◽

2017 ◽

Vol 30 (6) ◽

pp. 1079-1102 ◽

Cited By ~ 2

Author(s):

GABRIEL PEYRÉ ◽

LÉNAÏC CHIZAT ◽

FRANÇOIS-XAVIER VIALARD ◽

JUSTIN SOLOMON

Keyword(s):

Optimization Problem ◽

Optimal Transport ◽

Diffusion Tensor ◽

Texture Synthesis ◽

Positive Semidefinite ◽

Tensor Fields ◽

Convex Optimization Problem ◽

Von Neumann ◽

Entropic Regularization ◽

Quantum Formulation

This article introduces a new notion of optimal transport (OT) between tensor fields, which are measures whose values are positive semidefinite (PSD) matrices. This “quantum” formulation of optimal transport (Q-OT) corresponds to a relaxed version of the classical Kantorovich transport problem, where the fidelity between the input PSD-valued measures is captured using the geometry of the Von-Neumann quantum entropy. We propose a quantum-entropic regularization of the resulting convex optimization problem, which can be solved efficiently using an iterative scaling algorithm. This method is a generalization of the celebrated Sinkhorn algorithm to the quantum setting of PSD matrices. We extend this formulation and the quantum Sinkhorn algorithm to compute barycentres within a collection of input tensor fields. We illustrate the usefulness of the proposed approach on applications to procedural noise generation, anisotropic meshing, diffusion tensor imaging and spectral texture synthesis.

Download Full-text

Double-Like Accelerated Distributed optimization Algorithm for Convex optimization Problem

2020 10th International Conference on Information Science and Technology (ICIST) ◽

10.1109/icist49303.2020.9202354 ◽

2020 ◽

Author(s):

Keke Zhang ◽

Jiang Xiong ◽

Xiangguang Dai

Keyword(s):

Convex Optimization ◽

Optimization Algorithm ◽

Optimization Problem ◽

Distributed Optimization ◽

Convex Optimization Problem

Download Full-text

Markov decision process algorithms for wealth allocation problems with defaultable bonds

Advances in Applied Probability ◽

10.1017/apr.2016.6 ◽

2016 ◽

Vol 48 (2) ◽

pp. 392-405 ◽

Cited By ~ 1

Author(s):

Iker Perez ◽

David Hodge ◽

Huiling Le

Keyword(s):

Markov Decision Process ◽

Decision Process ◽

Optimization Problem ◽

Value Function ◽

Numerical Solutions ◽

Markov Decision ◽

Portfolio Optimization Problem ◽

Wealth Allocation ◽

Allocation Strategies

Abstract In this paper we are concerned with analysing optimal wealth allocation techniques within a defaultable financial market similar to Bielecki and Jang (2007). We study a portfolio optimization problem combining a continuous-time jump market and a defaultable security; and present numerical solutions through the conversion into a Markov decision process and characterization of its value function as a unique fixed point to a contracting operator. In this paper we analyse allocation strategies under several families of utility functions, and highlight significant portfolio selection differences with previously reported results.

Download Full-text