Direct Expected Quadratic Utility Maximization for Mean-Variance Controlled Reinforcement Learning

Author(s):  
Masahiro Kato ◽  
Kei Nakagawa
Author(s):  
Xin Huang ◽  
Duan Li

Traditional modeling on the mean-variance portfolio selection often assumes a full knowledge on statistics of assets' returns. It is, however, not always the case in real financial markets. This paper deals with an ambiguous mean-variance portfolio selection problem with a mixture model on the returns of risky assets, where the proportions of different component distributions are assumed to be unknown to the investor, but being constants (in any time instant). Taking into consideration the updates of proportions from future observations is essential to find an optimal policy with active learning feature, but makes the problem intractable when we adopt the classical methods. Using reinforcement learning, we derive an investment policy with a learning feature in a two-level framework. In the lower level, the time-decomposed approach (dynamic programming) is adopted to solve a family of scenario subcases where in each case the series of component distributions along multiple time periods is specified. At the upper level, a scenario-decomposed approach (progressive hedging algorithm) is applied in order to iteratively aggregate the scenario solutions from the lower layer based on the current knowledge on proportions, and this two-level solution framework is repeated in a manner of rolling horizon. We carry out experimental studies to illustrate the execution of our policy scheme.


2011 ◽  
Author(s):  
Ales Cerny ◽  
Fabio Maccheroni ◽  
Massimo Marinacci ◽  
Aldo Rustichini

1984 ◽  
Vol 39 (1) ◽  
pp. 47-61 ◽  
Author(s):  
YORAM KROLL ◽  
HAIM LEVY ◽  
HARRY M. MARKOWITZ

1991 ◽  
Vol 13 (2) ◽  
pp. 289 ◽  
Author(s):  
Robert A. Collins ◽  
Edward E. Gbur

2018 ◽  
Vol 27 (08) ◽  
pp. 1850034 ◽  
Author(s):  
Erick Asiain ◽  
Julio B. Clempner ◽  
Alexander S. Poznyak

In problems involving control of financial processes, it is usually complicated to quantify exactly the state variables. It could be expensive to acquire the exact value of a given state, even if it may be physically possible to do so. In such cases it may be interesting to support the decision-making process on inaccurate information pertaining to the system state. In addition, for modeling real-world application, it is necessary to compute the values of the parameters of the environment (transition probabilities and observation probabilities) and the reward functions, which are typically, hand-tuned by experts in the field until it has acquired a satisfactory value. This results in an undesired process. To address these shortcomings, this paper provides a new Reinforcement Learning (RL) framework for computing the mean-variance customer portfolio with transaction costs in controllable Partially Observable Markov Decision Processes (POMDPs). The solution is restricted to finite state, action, observation sets and average reward problems. For solving this problem, a controller/actor-critic architecture is proposed, which balance the difficult tasks of exploitation and exploration of the environment. The architecture consists of three modules: controller, fast-tracked portfolio learning and an actor-critic module. Each module involves the design of a convergent Temporal Difference (TD) learning algorithm. We employ three different learning rules to estimate the real values of: (a) the transition matrices [Formula: see text], (b) the rewards [Formula: see text] and (c) the resources destined for carrying out a promotion [Formula: see text]. We present a proof for the estimated transition matrix rule [Formula: see text] and showing that it converges when t → ∞. For solving the optimization programming problem we extend the c-variable method for partially observable Markov chains. The c-variable is conceptualized as joint strategy given by the product of the control policy, the observation kernel Q(y|s) and the stationary distribution vector. A major advantage of this procedure is that it can be implemented efficiently for real settings in controllable POMDP. A numerical example illustrates the results of the proposed method.


2012 ◽  
Vol 48 (6) ◽  
pp. 386-395 ◽  
Author(s):  
Aleš Černý ◽  
Fabio Maccheroni ◽  
Massimo Marinacci ◽  
Aldo Rustichini

Sign in / Sign up

Export Citation Format

Share Document