Multi-objective Reinforcement Learning through Continuous Pareto Manifold Approximation

Journal of Artificial Intelligence Research ◽

10.1613/jair.4961 ◽

2016 ◽

Vol 57 ◽

pp. 187-227 ◽

Cited By ~ 7

Author(s):

Simone Parisi ◽

Matteo Pirotta ◽

Marcello Restelli

Keyword(s):

Reinforcement Learning ◽

Pareto Frontier ◽

Water Reservoir ◽

Continuous Approximation ◽

Linear Quadratic ◽

Multi Objective ◽

Gradient Algorithms ◽

Conflicting Objectives ◽

Policy Gradient ◽

Markov Decision

Many real-world control applications, from economics to robotics, are characterized by the presence of multiple conflicting objectives. In these problems, the standard concept of optimality is replaced by Pareto-optimality and the goal is to find the Pareto frontier, a set of solutions representing different compromises among the objectives. Despite recent advances in multi-objective optimization, achieving an accurate representation of the Pareto frontier is still an important challenge. In this paper, we propose a reinforcement learning policy gradient approach to learn a continuous approximation of the Pareto frontier in multi-objective Markov Decision Problems (MOMDPs). Differently from previous policy gradient algorithms, where n optimization routines are executed to have n solutions, our approach performs a single gradient ascent run, generating at each step an improved continuous approximation of the Pareto frontier. The idea is to optimize the parameters of a function defining a manifold in the policy parameters space, so that the corresponding image in the objectives space gets as close as possible to the true Pareto frontier. Besides deriving how to compute and estimate such gradient, we will also discuss the non-trivial issue of defining a metric to assess the quality of the candidate Pareto frontiers. Finally, the properties of the proposed approach are empirically evaluated on two problems, a linear-quadratic Gaussian regulator and a water reservoir control task.

Download Full-text

Directed Exploration in Black-Box Optimization for Multi-Objective Reinforcement Learning

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622019500093 ◽

2019 ◽

Vol 18 (03) ◽

pp. 1045-1082 ◽

Cited By ~ 2

Author(s):

Javier García ◽

Roberto Iglesias ◽

Miguel A. Rodríguez ◽

Carlos V. Regueiro

Keyword(s):

Reinforcement Learning ◽

Water Reservoir ◽

Search Space ◽

Population Based ◽

Learning System ◽

Multi Objective ◽

Gradient Information ◽

Conflicting Objectives ◽

Policy Gradient ◽

Learning Policies

Usually, real-world problems involve the optimization of multiple, possibly conflicting, objectives. These problems may be addressed by Multi-objective Reinforcement learning (MORL) techniques. MORL is a generalization of standard Reinforcement Learning (RL) where the single reward signal is extended to multiple signals, in particular, one for each objective. MORL is the process of learning policies that optimize multiple objectives simultaneously. In these problems, the use of directional/gradient information can be useful to guide the exploration to better and better behaviors. However, traditional policy-gradient approaches have two main drawbacks: they require the use of a batch of episodes to properly estimate the gradient information (reducing in this way the learning speed), and they use stochastic policies which could have a disastrous impact on the safety of the learning system. In this paper, we present a novel population-based MORL algorithm for problems in which the underlying objectives are reasonably smooth. It presents two main characteristics: fast computation of the gradient information for each objective through the use of neighboring solutions, and the use of this information to carry out a geometric partition of the search space and thus direct the exploration to promising areas. Finally, the algorithm is evaluated and compared to policy gradient MORL algorithms on different multi-objective problems: the water reservoir and the biped walking problem (the latter both on simulation and on a real robot).

Download Full-text

Deterministic policy gradient algorithms for semi‐Markov decision processes

International Journal of Intelligent Systems ◽

10.1002/int.22709 ◽

2021 ◽

Author(s):

Ashkan Haji Hosseinloo ◽

Munther A. Dahleh

Keyword(s):

Markov Decision Processes ◽

Decision Processes ◽

Gradient Algorithms ◽

Policy Gradient ◽

Markov Decision

Download Full-text

Adaptive Multi-objective Reinforcement Learning for Pareto Frontier Approximation: A Case Study of Resource Allocation Network in Massive MIMO

10.23919/eusipco54536.2021.9615934 ◽

2021 ◽

Author(s):

Ruiqing Chen ◽

Fanglei Sun ◽

Liang Chen ◽

Kai Li ◽

Liantao Wu ◽

...

Keyword(s):

Resource Allocation ◽

Reinforcement Learning ◽

Massive Mimo ◽

Pareto Frontier ◽

Multi Objective

Download Full-text

Multi-Objective Pricing Optimization for a High-Speed Rail Network Under Competition

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198119842817 ◽

2019 ◽

Vol 2673 (7) ◽

pp. 215-226 ◽

Cited By ~ 1

Author(s):

Huizhuo Cao ◽

Xuemei Li ◽

Vikrant Vaze ◽

Xueyan Li

Keyword(s):

High Speed ◽

Programming Model ◽

Pareto Frontier ◽

Multiple Objectives ◽

Multi Objective Optimization ◽

High Speed Rail ◽

Multi Objective ◽

Trade Offs ◽

Conflicting Objectives ◽

Single Objective

Multi-objective pricing of high-speed rail (HSR) passenger fares becomes a challenge when the HSR operator needs to deal with multiple conflicting objectives. Although many studies have tackled the challenge of calculating the optimal fares over railway networks, none of them focused on characterizing the trade-offs between multiple objectives under multi-modal competition. We formulate the multi-objective HSR fare optimization problem over a linear network by introducing the epsilon-constraint method within a bi-level programming model and develop an iterative algorithm to solve this model. This is the first HSR pricing study to use an epsilon-constraint methodology. We obtain two single-objective solutions and four multi-objective solutions and compare them on a variety of metrics. We also derive the Pareto frontier between the objectives of profit and passenger welfare to enable the operator to choose the best trade-off. Our results based on computational experiments with Beijing–Shanghai regional network provide several new insights. First, we find that small changes in fares can lead to a significant improvement in passenger welfare with no reduction in profitability under multi-objective optimization. Second, multi-objective optimization solutions show considerable improvements over the single-objective optimization solutions. Third, Pareto frontier enables decision-makers to make more informed decisions about choosing the best trade-offs. Overall, the explicit modeling of multiple objectives leads to better pricing solutions, which have the potential to guide pricing decisions for the HSR operators.

Download Full-text

Tuning Scaling Factors of Fuzzy Logic Controllers via Reinforcement Learning Policy Gradient Algorithms

Proceedings of the 3rd International Conference on Mechatronics and Robotics Engineering - ICMRE 2017 ◽

10.1145/3068796.3068827 ◽

2017 ◽

Author(s):

Vahid Tavakol Aghaei ◽

Ahmet Onat

Keyword(s):

Fuzzy Logic ◽

Reinforcement Learning ◽

Fuzzy Logic Controllers ◽

Scaling Factors ◽

Gradient Algorithms ◽

Policy Gradient ◽

Logic Controllers

Download Full-text

Reinforcement Learning Applied to a Differential Game

Adaptive Behavior ◽

10.1177/105971239500400102 ◽

1995 ◽

Vol 4 (1) ◽

pp. 3-28 ◽

Cited By ~ 15

Author(s):

Mance E. Harmon ◽

Leemon C. Baird ◽

A. Harry Klopf

Keyword(s):

Reinforcement Learning ◽

Differential Game ◽

Learning Algorithm ◽

Learning System ◽

Test Bed ◽

Linear Quadratic ◽

Time Step ◽

Q Learning ◽

Step Duration ◽

Markov Decision

An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual-gradient form of advantage updating. The game is a Markov decision process with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. Although a missile and plane scenario was the chosen test bed, the reinforcement learning approach presented here is equally applicable to biologically based systems, such as a predator pursuing prey. The reinforcement learning algorithm for optimal control is modified for differential games to find the minimax point rather than the maximum. Simulation results are compared to the analytical solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual-gradient and non-residual-gradient forms of advantage updating and Q-learning are compared, demonstrating that advantage updating converges faster than Q-learning in all simulations. Advantage updating also is demonstrated to converge regardless of the time step duration; Q-learning is unable to converge as the time step duration grows small.

Download Full-text

Dynamic Bicycle Dispatching of Dockless Public Bicycle-sharing Systems Using Multi-objective Reinforcement Learning

ACM Transactions on Cyber-Physical Systems ◽

10.1145/3447623 ◽

2021 ◽

Vol 5 (4) ◽

pp. 1-24

Author(s):

Jianguo Chen ◽

Kenli Li ◽

Keqin Li ◽

Philip S. Yu ◽

Zeng Zeng

Keyword(s):

Reinforcement Learning ◽

Dynamic Balance ◽

Supply And Demand ◽

Pareto Frontier ◽

Cyber Physical Systems ◽

Physical Systems ◽

Optimal Dispatch ◽

Multi Objective ◽

Dispatching Algorithm ◽

Multiple Dispatch

As a new generation of Public Bicycle-sharing Systems (PBS), the Dockless PBS (DL-PBS) is an important application of cyber-physical systems and intelligent transportation. How to use artificial intelligence to provide efficient bicycle dispatching solutions based on dynamic bicycle rental demand is an essential issue for DL-PBS. In this article, we propose MORL-BD, a dynamic bicycle dispatching algorithm based on multi-objective reinforcement learning to provide the optimal bicycle dispatching solution for DL-PBS. We model the DL-PBS system from the perspective of cyber-physical systems and use deep learning to predict the layout of bicycle parking spots and the dynamic demand of bicycle dispatching. We define the multi-route bicycle dispatching problem as a multi-objective optimization problem by considering the optimization objectives of dispatching costs, dispatch truck's initial load, workload balance among the trucks, and the dynamic balance of bicycle supply and demand. On this basis, the collaborative multi-route bicycle dispatching problem among multiple dispatch trucks is modeled as a multi-agent and multi-objective reinforcement learning model. All dispatch paths between parking spots are defined as state spaces, and the reciprocal of dispatching costs is defined as a reward. Each dispatch truck is equipped with an agent to learn the optimal dispatch path in the dynamic DL-PBS network. We create an elite list to store the Pareto optimal solutions of bicycle dispatch paths found in each action, and finally get the Pareto frontier. Experimental results on the actual DL-PBS show that compared with existing methods, MORL-BD can find a higher quality Pareto frontier with less execution time.

Download Full-text

Reinforcement Learning in Linear Quadratic Deep Structured Teams: Global Convergence of Policy Gradient Methods

2020 59th IEEE Conference on Decision and Control (CDC) ◽

10.1109/cdc42340.2020.9304234 ◽

2020 ◽

Author(s):

Vida Fathi ◽

Jalal Arabneydi ◽

Amir G. Aghdam

Keyword(s):

Reinforcement Learning ◽

Global Convergence ◽

Gradient Methods ◽

Linear Quadratic ◽

Policy Gradient

Download Full-text

Efficient Multi-objective Reinforcement Learning via Multiple-gradient Descent with Iteratively Discovered Weight-Vector Sets

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.12270 ◽

2021 ◽

Vol 70 ◽

pp. 319-349

Author(s):

Yongcan Cao ◽

Huixin Zhan

Keyword(s):

Reinforcement Learning ◽

Loss Function ◽

Value Function ◽

Optimization Problems ◽

A Priori ◽

Weight Vector ◽

Multi Objective ◽

Conflicting Objectives ◽

Optimal Policies ◽

The Impact

Solving multi-objective optimization problems is important in various applications where users are interested in obtaining optimal policies subject to multiple (yet often conflicting) objectives. A typical approach to obtain the optimal policies is to first construct a loss function based on the scalarization of individual objectives and then derive optimal policies that minimize the scalarized loss function. Albeit simple and efficient, the typical approach provides no insights/mechanisms on the optimization of multiple objectives due to the lack of ability to quantify the inter-objective relationship. To address the issue, we propose to develop a new efficient gradient-based multi-objective reinforcement learning approach that seeks to iteratively uncover the quantitative inter-objective relationship via finding a minimum-norm point in the convex hull of the set of multiple policy gradients when the impact of one objective on others is unknown a priori. In particular, we first propose a new PAOLS algorithm that integrates pruning and approximate optimistic linear support algorithm to efficiently discover the weight-vector sets of multiple gradients that quantify the inter-objective relationship. Then we construct an actor and a multi-objective critic that can co-learn the policy and the multi-objective vector value function. Finally, the weight discovery process and the policy and vector value function learning process can be iteratively executed to yield stable weight-vector sets and policies. To validate the effectiveness of the proposed approach, we present a quantitative evaluation of the approach based on three case studies.

Download Full-text

The Successful Ingredients of Policy Gradient Algorithms

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/338 ◽

2021 ◽

Author(s):

Sven Gronauer ◽

Martin Gottwald ◽

Klaus Diepold

Keyword(s):

Reinforcement Learning ◽

State Of The Art ◽

Gradient Methods ◽

Gradient Algorithms ◽

The Sublime ◽

Policy Gradient ◽

Art Performance ◽

Underlying Mechanisms

Despite the sublime success in recent years, the underlying mechanisms powering the advances of reinforcement learning are yet poorly understood. In this paper, we identify these mechanisms - which we call ingredients - in on-policy policy gradient methods and empirically determine their impact on the learning. To allow an equitable assessment, we conduct our experiments based on a unified and modular implementation. Our results underline the significance of recent algorithmic advances and demonstrate that reaching state-of-the-art performance may not need sophisticated algorithms but can also be accomplished by the combination of a few simple ingredients.

Download Full-text