scholarly journals Multi-objective Reinforcement Learning through Continuous Pareto Manifold Approximation

2016 ◽  
Vol 57 ◽  
pp. 187-227 ◽  
Author(s):  
Simone Parisi ◽  
Matteo Pirotta ◽  
Marcello Restelli

Many real-world control applications, from economics to robotics, are characterized by the presence of multiple conflicting objectives. In these problems, the standard concept of optimality is replaced by Pareto-optimality and the goal is to find the Pareto frontier, a set of solutions representing different compromises among the objectives. Despite recent advances in multi-objective optimization, achieving an accurate representation of the Pareto frontier is still an important challenge. In this paper, we propose a reinforcement learning policy gradient approach to learn a continuous approximation of the Pareto frontier in multi-objective Markov Decision Problems (MOMDPs). Differently from previous policy gradient algorithms, where n optimization routines are executed to have n solutions, our approach performs a single gradient ascent run, generating at each step an improved continuous approximation of the Pareto frontier. The idea is to optimize the parameters of a function defining a manifold in the policy parameters space, so that the corresponding image in the objectives space gets as close as possible to the true Pareto frontier. Besides deriving how to compute and estimate such gradient, we will also discuss the non-trivial issue of defining a metric to assess the quality of the candidate Pareto frontiers. Finally, the properties of the proposed approach are empirically evaluated on two problems, a linear-quadratic Gaussian regulator and a water reservoir control task.

2019 ◽  
Vol 18 (03) ◽  
pp. 1045-1082 ◽  
Author(s):  
Javier García ◽  
Roberto Iglesias ◽  
Miguel A. Rodríguez ◽  
Carlos V. Regueiro

Usually, real-world problems involve the optimization of multiple, possibly conflicting, objectives. These problems may be addressed by Multi-objective Reinforcement learning (MORL) techniques. MORL is a generalization of standard Reinforcement Learning (RL) where the single reward signal is extended to multiple signals, in particular, one for each objective. MORL is the process of learning policies that optimize multiple objectives simultaneously. In these problems, the use of directional/gradient information can be useful to guide the exploration to better and better behaviors. However, traditional policy-gradient approaches have two main drawbacks: they require the use of a batch of episodes to properly estimate the gradient information (reducing in this way the learning speed), and they use stochastic policies which could have a disastrous impact on the safety of the learning system. In this paper, we present a novel population-based MORL algorithm for problems in which the underlying objectives are reasonably smooth. It presents two main characteristics: fast computation of the gradient information for each objective through the use of neighboring solutions, and the use of this information to carry out a geometric partition of the search space and thus direct the exploration to promising areas. Finally, the algorithm is evaluated and compared to policy gradient MORL algorithms on different multi-objective problems: the water reservoir and the biped walking problem (the latter both on simulation and on a real robot).


Author(s):  
Huizhuo Cao ◽  
Xuemei Li ◽  
Vikrant Vaze ◽  
Xueyan Li

Multi-objective pricing of high-speed rail (HSR) passenger fares becomes a challenge when the HSR operator needs to deal with multiple conflicting objectives. Although many studies have tackled the challenge of calculating the optimal fares over railway networks, none of them focused on characterizing the trade-offs between multiple objectives under multi-modal competition. We formulate the multi-objective HSR fare optimization problem over a linear network by introducing the epsilon-constraint method within a bi-level programming model and develop an iterative algorithm to solve this model. This is the first HSR pricing study to use an epsilon-constraint methodology. We obtain two single-objective solutions and four multi-objective solutions and compare them on a variety of metrics. We also derive the Pareto frontier between the objectives of profit and passenger welfare to enable the operator to choose the best trade-off. Our results based on computational experiments with Beijing–Shanghai regional network provide several new insights. First, we find that small changes in fares can lead to a significant improvement in passenger welfare with no reduction in profitability under multi-objective optimization. Second, multi-objective optimization solutions show considerable improvements over the single-objective optimization solutions. Third, Pareto frontier enables decision-makers to make more informed decisions about choosing the best trade-offs. Overall, the explicit modeling of multiple objectives leads to better pricing solutions, which have the potential to guide pricing decisions for the HSR operators.


1995 ◽  
Vol 4 (1) ◽  
pp. 3-28 ◽  
Author(s):  
Mance E. Harmon ◽  
Leemon C. Baird ◽  
A. Harry Klopf

An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual-gradient form of advantage updating. The game is a Markov decision process with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. Although a missile and plane scenario was the chosen test bed, the reinforcement learning approach presented here is equally applicable to biologically based systems, such as a predator pursuing prey. The reinforcement learning algorithm for optimal control is modified for differential games to find the minimax point rather than the maximum. Simulation results are compared to the analytical solution, demonstrating that the simulated reinforcement learning system converges to the optimal answer. The performance of both the residual-gradient and non-residual-gradient forms of advantage updating and Q-learning are compared, demonstrating that advantage updating converges faster than Q-learning in all simulations. Advantage updating also is demonstrated to converge regardless of the time step duration; Q-learning is unable to converge as the time step duration grows small.


2021 ◽  
Vol 5 (4) ◽  
pp. 1-24
Author(s):  
Jianguo Chen ◽  
Kenli Li ◽  
Keqin Li ◽  
Philip S. Yu ◽  
Zeng Zeng

As a new generation of Public Bicycle-sharing Systems (PBS), the Dockless PBS (DL-PBS) is an important application of cyber-physical systems and intelligent transportation. How to use artificial intelligence to provide efficient bicycle dispatching solutions based on dynamic bicycle rental demand is an essential issue for DL-PBS. In this article, we propose MORL-BD, a dynamic bicycle dispatching algorithm based on multi-objective reinforcement learning to provide the optimal bicycle dispatching solution for DL-PBS. We model the DL-PBS system from the perspective of cyber-physical systems and use deep learning to predict the layout of bicycle parking spots and the dynamic demand of bicycle dispatching. We define the multi-route bicycle dispatching problem as a multi-objective optimization problem by considering the optimization objectives of dispatching costs, dispatch truck's initial load, workload balance among the trucks, and the dynamic balance of bicycle supply and demand. On this basis, the collaborative multi-route bicycle dispatching problem among multiple dispatch trucks is modeled as a multi-agent and multi-objective reinforcement learning model. All dispatch paths between parking spots are defined as state spaces, and the reciprocal of dispatching costs is defined as a reward. Each dispatch truck is equipped with an agent to learn the optimal dispatch path in the dynamic DL-PBS network. We create an elite list to store the Pareto optimal solutions of bicycle dispatch paths found in each action, and finally get the Pareto frontier. Experimental results on the actual DL-PBS show that compared with existing methods, MORL-BD can find a higher quality Pareto frontier with less execution time.


2021 ◽  
Vol 70 ◽  
pp. 319-349
Author(s):  
Yongcan Cao ◽  
Huixin Zhan

Solving multi-objective optimization problems is important in various applications where users are interested in obtaining optimal policies subject to multiple (yet often conflicting) objectives. A typical approach to obtain the optimal policies is to first construct a loss function based on the scalarization of individual objectives and then derive optimal policies that minimize the scalarized loss function. Albeit simple and efficient, the typical approach provides no insights/mechanisms on the optimization of multiple objectives due to the lack of ability to quantify the inter-objective relationship. To address the issue, we propose to develop a new efficient gradient-based multi-objective reinforcement learning approach that seeks to iteratively uncover the quantitative inter-objective relationship via finding a minimum-norm point in the convex hull of the set of multiple policy gradients when the impact of one objective on others is unknown a priori. In particular, we first propose a new PAOLS algorithm that integrates pruning and approximate optimistic linear support algorithm to efficiently discover the weight-vector sets of multiple gradients that quantify the inter-objective relationship. Then we construct an actor and a multi-objective critic that can co-learn the policy and the multi-objective vector value function. Finally, the weight discovery process and the policy and vector value function learning process can be iteratively executed to yield stable weight-vector sets and policies. To validate the effectiveness of the proposed approach, we present a quantitative evaluation of the approach based on three case studies.


Author(s):  
Sven Gronauer ◽  
Martin Gottwald ◽  
Klaus Diepold

Despite the sublime success in recent years, the underlying mechanisms powering the advances of reinforcement learning are yet poorly understood. In this paper, we identify these mechanisms - which we call ingredients - in on-policy policy gradient methods and empirically determine their impact on the learning. To allow an equitable assessment, we conduct our experiments based on a unified and modular implementation. Our results underline the significance of recent algorithmic advances and demonstrate that reaching state-of-the-art performance may not need sophisticated algorithms but can also be accomplished by the combination of a few simple ingredients.


Sign in / Sign up

Export Citation Format

Share Document