stationary policy
Recently Published Documents


TOTAL DOCUMENTS

47
(FIVE YEARS 6)

H-INDEX

10
(FIVE YEARS 0)

Author(s):  
Irina A. Kochetkova ◽  
Anastasia S. Vlaskina ◽  
Dmitriy V. Efrosinin ◽  
Abdukodir A. Khakimov ◽  
Sofiya A. Burtseva

The concept of cloud computing was created to better preserve user privacy and data storage security. However, the resources allocated for processing this data must be optimally allocated. The problem of optimal resource management in the loud computing environment is described in many scientific publications. To solve the problems of optimality of the distribution of resources of systems, you can use the construction and analysis of QS. We conduct an analysis of two-buffer queuing system with cross-type service and additional penalties, based on the literature reviewed in the article. This allows us to assess how suitable the model presented in the article is for application to cloud computing. For a given system different options for selecting applications from queues are possible, queue numbers, therefore, the intensities of transitions between the states of the system will change. For this, the system has a choice policy that allows the system to decide how to behave depending on its state. There are four components of such selection management models, which is a stationary policy for selecting a queue number to service a ticket on a vacated virtual machine each time immediately before service ends. A simulation model was built for numerical analysis. The results obtained indicate that requests are practically not delayed in the queue of the presented QS, and therefore the policy for a given model can be considered optimal. Although Poisson flow is the simplest for simulation, it is quite acceptable for performance evaluation. In the future, it is planned to conduct several more experiments for different values of the intensity of requests and various types of incoming flows.


2021 ◽  
Vol 58 (2) ◽  
pp. 523-550
Author(s):  
Xin Guo ◽  
Yonghui Huang

AbstractThis paper considers risk-sensitive average optimization for denumerable continuous-time Markov decision processes (CTMDPs), in which the transition and cost rates are allowed to be unbounded, and the policies can be randomized history dependent. We first derive the multiplicative dynamic programming principle and some new facts for risk-sensitive finite-horizon CTMDPs. Then, we establish the existence and uniqueness of a solution to the risk-sensitive average optimality equation (RS-AOE) through the results for risk-sensitive finite-horizon CTMDPs developed here, and also prove the existence of an optimal stationary policy via the RS-AOE. Furthermore, for the case of finite actions available at each state, we construct a sequence of models of finite-state CTMDPs with optimal stationary policies which can be obtained by a policy iteration algorithm in a finite number of iterations, and prove that an average optimal policy for the case of infinitely countable states can be approximated by those of the finite-state models. Finally, we illustrate the conditions and the iteration algorithm with an example.


2021 ◽  
Vol 229 ◽  
pp. 01047
Author(s):  
Abdellatif Semmouri ◽  
Mostafa Jourhmane ◽  
Bahaa Eddine Elbaghazaoui

In this paper we consider a constrained optimization of discrete time Markov Decision Processes (MDPs) with finite state and action spaces, which accumulate both a reward and costs at each decision epoch. We will study the problem of finding a policy that maximizes the expected total discounted reward subject to the constraints that the expected total discounted costs are not greater than given values. Thus, we will investigate the decomposition method of the state space into the strongly communicating classes for computing an optimal or a nearly optimal stationary policy. The discounted criterion has many applications in several areas such that the Forest Management, the Management of Energy Consumption, the finance, the Communication System (Mobile Networks) and the artificial intelligence.


Author(s):  
Nicole Bäuerle ◽  
Anna Jaśkiewicz ◽  
Andrzej S. Nowak

AbstractIn this paper, we study a Markov decision process with a non-linear discount function and with a Borel state space. We define a recursive discounted utility, which resembles non-additive utility functions considered in a number of models in economics. Non-additivity here follows from non-linearity of the discount function. Our study is complementary to the work of Jaśkiewicz et al. (Math Oper Res 38:108–121, 2013), where also non-linear discounting is used in the stochastic setting, but the expectation of utilities aggregated on the space of all histories of the process is applied leading to a non-stationary dynamic programming model. Our aim is to prove that in the recursive discounted utility case the Bellman equation has a solution and there exists an optimal stationary policy for the problem in the infinite time horizon. Our approach includes two cases: (a) when the one-stage utility is bounded on both sides by a weight function multiplied by some positive and negative constants, and (b) when the one-stage utility is unbounded from below.


2020 ◽  
Vol 34 (10) ◽  
pp. 13771-13772
Author(s):  
Ian Davies ◽  
Zheng Tian ◽  
Jun Wang

Multi-Agent Reinforcement Learning (MARL) considers settings in which a set of coexisting agents interact with one another and their environment. The adaptation and learning of other agents induces non-stationarity in the environment dynamics. This poses a great challenge for value function-based algorithms whose convergence usually relies on the assumption of a stationary environment. Policy search algorithms also struggle in multi-agent settings as the partial observability resulting from an opponent's actions not being known introduces high variance to policy training. Modelling an agent's opponent(s) is often pursued as a means of resolving the issues arising from the coexistence of learning opponents. An opponent model provides an agent with some ability to reason about other agents to aid its own decision making. Most prior works learn an opponent model by assuming the opponent is employing a stationary policy or switching between a set of stationary policies. Such an approach can reduce the variance of training signals for policy search algorithms. However, in the multi-agent setting, agents have an incentive to continually adapt and learn. This means that the assumptions concerning opponent stationarity are unrealistic. In this work, we develop a novel approach to modelling an opponent's learning dynamics which we term Learning to Model Opponent Learning (LeMOL). We show our structured opponent model is more accurate and stable than naive behaviour cloning baselines. We further show that opponent modelling can improve the performance of algorithmic agents in multi-agent settings.


2015 ◽  
Vol 52 (2) ◽  
pp. 419-440
Author(s):  
Rolando Cavazos-Cadena ◽  
Raúl Montes-De-Oca ◽  
Karel Sladký

This paper concerns discrete-time Markov decision chains with denumerable state and compact action sets. Besides standard continuity requirements, the main assumption on the model is that it admits a Lyapunov function ℓ. In this context the average reward criterion is analyzed from the sample-path point of view. The main conclusion is that if the expected average reward associated to ℓ2 is finite under any policy then a stationary policy obtained from the optimality equation in the standard way is sample-path average optimal in a strong sense.


2015 ◽  
Vol 52 (02) ◽  
pp. 419-440 ◽  
Author(s):  
Rolando Cavazos-Cadena ◽  
Raúl Montes-De-Oca ◽  
Karel Sladký

This paper concerns discrete-time Markov decision chains with denumerable state and compact action sets. Besides standard continuity requirements, the main assumption on the model is that it admits a Lyapunov function ℓ. In this context the average reward criterion is analyzed from the sample-path point of view. The main conclusion is that if the expected average reward associated to ℓ2is finite under any policy then a stationary policy obtained from the optimality equation in the standard way is sample-path average optimal in a strong sense.


2014 ◽  
Vol 01 (01) ◽  
pp. 1450008 ◽  
Author(s):  
William Solecki ◽  
Cynthia Rosenzweig

This paper illustrates and examines the development of a flexible climate adaptation approach and non-stationary climate policy in New York City in the post-Hurricane Sandy context. Extreme events, such as Hurricane Sandy, are presented as learning opportunities and create a policy window for outside-of-the-box solutions and experimentation. The research investigates the institutionalization of laws, standards, and codes that are required to reflect an increasingly dynamic set of local environmental stresses associated with climate change. The City of New York responded to Hurricane Sandy with a set of targeted adjustments to the existing infrastructure and building stock in a way that both makes it more resistant (i.e., strengthened) and resilient (i.e., responsive to stress) in the face of future extreme events. Post-Sandy New York experiences show that the conditions for a post-disaster flexible adaptation response exist, and evidence shows that the beginnings of a non-stationary policy generation process have been put into place. More broadly, post-disaster policy processes have been configured in New York to enable continuous co-production of knowledge by scientists and the community of decision-makers and stakeholders.


2014 ◽  
Vol 46 (1) ◽  
pp. 121-138 ◽  
Author(s):  
Ulrich Rieder ◽  
Marc Wittlinger

We consider an investment problem where observing and trading are only possible at random times. In addition, we introduce drawdown constraints which require that the investor's wealth does not fall under a prior fixed percentage of its running maximum. The financial market consists of a riskless bond and a stock which is driven by a Lévy process. Moreover, a general utility function is assumed. In this setting we solve the investment problem using a related limsup Markov decision process. We show that the value function can be characterized as the unique fixed point of the Bellman equation and verify the existence of an optimal stationary policy. Under some mild assumptions the value function can be approximated by the value function of a contracting Markov decision process. We are able to use Howard's policy improvement algorithm for computing the value function as well as an optimal policy. These results are illustrated in a numerical example.


Sign in / Sign up

Export Citation Format

Share Document