scholarly journals LSTM-DDPG for Trading with Variable Positions

Sensors ◽  
2021 ◽  
Vol 21 (19) ◽  
pp. 6571
Author(s):  
Zhichao Jia ◽  
Qiang Gao ◽  
Xiaohong Peng

In recent years, machine learning for trading has been widely studied. The direction and size of position should be determined in trading decisions based on market conditions. However, there is no research so far that considers variable position sizes in models developed for trading purposes. In this paper, we propose a deep reinforcement learning model named LSTM-DDPG to make trading decisions with variable positions. Specifically, we consider the trading process as a Partially Observable Markov Decision Process, in which the long short-term memory (LSTM) network is used to extract market state features and the deep deterministic policy gradient (DDPG) framework is used to make trading decisions concerning the direction and variable size of position. We test the LSTM-DDPG model on IF300 (index futures of China stock market) data and the results show that LSTM-DDPG with variable positions performs better in terms of return and risk than models with fixed or few-level positions. In addition, the investment potential of the model can be better tapped by the reward function of the differential Sharpe ratio than that of profit reward function.

Algorithms ◽  
2020 ◽  
Vol 13 (11) ◽  
pp. 307
Author(s):  
Luca Pasqualini ◽  
Maurizio Parton

A Pseudo-Random Number Generator (PRNG) is any algorithm generating a sequence of numbers approximating properties of random numbers. These numbers are widely employed in mid-level cryptography and in software applications. Test suites are used to evaluate the quality of PRNGs by checking statistical properties of the generated sequences. These sequences are commonly represented bit by bit. This paper proposes a Reinforcement Learning (RL) approach to the task of generating PRNGs from scratch by learning a policy to solve a partially observable Markov Decision Process (MDP), where the full state is the period of the generated sequence, and the observation at each time-step is the last sequence of bits appended to such states. We use Long-Short Term Memory (LSTM) architecture to model the temporal relationship between observations at different time-steps by tasking the LSTM memory with the extraction of significant features of the hidden portion of the MDP’s states. We show that modeling a PRNG with a partially observable MDP and an LSTM architecture largely improves the results of the fully observable feedforward RL approach introduced in previous work.


2019 ◽  
pp. 105971231989164
Author(s):  
Viet-Hung Dang ◽  
Ngo Anh Vien ◽  
TaeChoong Chung

Learning to make decisions in partially observable environments is a notorious problem that requires a complex representation of controllers. In most work, the controllers are designed as a non-linear mapping from a sequence of temporal observations to actions. These problems can, in principle, be formulated as a partially observable Markov decision process whose policy can be parameterised through the use of recurrent neural networks. In this paper, we will propose an alternative framework that (a) uses the Long-Short-Term-Memory (LSTM) Encoder-Decoder framework to learn an internal state representation for historical observations and then (b) integrates it into existing recurrent policy models to improve the task performance. The LSTM Encoder encodes a history of observations as input into a representation of internal states. The LSTM Decoder can perform two alternative decoding tasks: predicting the same input observation sequence or predicting future observation sequences. The first proposed decoder acts like an auto-encoder that will guide and constrain the learning of a useful internal state for the policy optimisation task. The second proposed decoder decodes the learnt internal state by the encoder to predict future observation sequences. This idea makes the network act like a non-linear predictive state representation model. Both these decoding parts, which introduce constraints to policy representation, will help guide both the policy optimisation problem and latent state representation learning. The integration of representation learning and policy optimisation aims to help learn more complex policies and improve the performance of policy learning tasks.


2001 ◽  
Vol 15 ◽  
pp. 351-381 ◽  
Author(s):  
J. Baxter ◽  
P. L. Bartlett ◽  
L. Weaver

In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, this volume), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter beta, which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of (Baxter & Bartlett, this volume) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.


2020 ◽  
Vol 13 (4) ◽  
pp. 78
Author(s):  
Nico Zengeler ◽  
Uwe Handmann

We present a deep reinforcement learning framework for an automatic trading of contracts for difference (CfD) on indices at a high frequency. Our contribution proves that reinforcement learning agents with recurrent long short-term memory (LSTM) networks can learn from recent market history and outperform the market. Usually, these approaches depend on a low latency. In a real-world example, we show that an increased model size may compensate for a higher latency. As the noisy nature of economic trends complicates predictions, especially in speculative assets, our approach does not predict courses but instead uses a reinforcement learning agent to learn an overall lucrative trading policy. Therefore, we simulate a virtual market environment, based on historical trading data. Our environment provides a partially observable Markov decision process (POMDP) to reinforcement learners and allows the training of various strategies.


2016 ◽  
Vol 2016 ◽  
pp. 1-14
Author(s):  
Zuo-wei Wang

Identifying the hidden state is important for solving problems with hidden state. We prove any deterministic partially observable Markov decision processes (POMDP) can be represented by a minimal, looping hidden state transition model and propose a heuristic state transition model constructing algorithm. A new spatiotemporal associative memory network (STAMN) is proposed to realize the minimal, looping hidden state transition model. STAMN utilizes the neuroactivity decay to realize the short-term memory, connection weights between different nodes to represent long-term memory, presynaptic potentials, and synchronized activation mechanism to complete identifying and recalling simultaneously. Finally, we give the empirical illustrations of the STAMN and compare the performance of the STAMN model with that of other methods.


Sensors ◽  
2020 ◽  
Vol 20 (9) ◽  
pp. 2481
Author(s):  
Ady-Daniel Mezei ◽  
Levente Tamás ◽  
Lucian Buşoniu

We consider a robot that must sort objects transported by a conveyor belt into different classes. Multiple observations must be performed before taking a decision on the class of each object, because the imperfect sensing sometimes detects the incorrect object class. The objective is to sort the sequence of objects in a minimal number of observation and decision steps. We describe this task in the framework of partially observable Markov decision processes, and we propose a reward function that explicitly takes into account the information gain of the viewpoint selection actions applied. The DESPOT algorithm is applied to solve the problem, automatically obtaining a sequence of observation viewpoints and class decision actions. Observations are made either only for the object on the first position of the conveyor belt or for multiple adjacent positions at once. The performance of the single- and multiple-position variants is compared, and the impact of including the information gain is analyzed. Real-life experiments with a Baxter robot and an industrial conveyor belt are provided.


F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 1619
Author(s):  
Taha Abdelhalim Nakabi ◽  
Pekka Toivanen

In this paper, we consider the problem of thermostatically controlled load (TCL) control through dynamic electricity prices, under partial observability of the environment and uncertainty of the control response. The problem is formulated as a Markov decision process where an agent must find a near-optimal pricing scheme using partial observations of the state and action. We propose a long-short-term memory (LSTM) network to learn the individual behaviors of TCL units. We use the aggregated information to predict the response of the TCL cluster to a pricing policy. We use this prediction model in a genetic algorithm to find the best prices in terms of profit maximization in an energy arbitrage operation. The simulation results show that the proposed method offers a profit equal to 96% of the theoretical optimal solution.


2016 ◽  
Vol 28 (3) ◽  
pp. 563-593 ◽  
Author(s):  
Yao Ma ◽  
Tingting Zhao ◽  
Kohei Hatano ◽  
Masashi Sugiyama

We consider the learning problem under an online Markov decision process (MDP) aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret—the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this letter, we show that a simple online policy gradient algorithm achieves regret [Formula: see text] for T steps under a certain concavity assumption and [Formula: see text] under a strong concavity assumption. To the best of our knowledge, this is the first work to present an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the proposed online policy gradient method through experiments.


2021 ◽  
Author(s):  
Stav Belogolovsky ◽  
Philip Korsunsky ◽  
Shie Mannor ◽  
Chen Tessler ◽  
Tom Zahavy

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.


Author(s):  
Chaochao Lin ◽  
Matteo Pozzi

Optimal exploration of engineering systems can be guided by the principle of Value of Information (VoI), which accounts for the topological important of components, their reliability and the management costs. For series systems, in most cases higher inspection priority should be given to unreliable components. For redundant systems such as parallel systems, analysis of one-shot decision problems shows that higher inspection priority should be given to more reliable components. This paper investigates the optimal exploration of redundant systems in long-term decision making with sequential inspection and repairing. When the expected, cumulated, discounted cost is considered, it may become more efficient to give higher inspection priority to less reliable components, in order to preserve system redundancy. To investigate this problem, we develop a Partially Observable Markov Decision Process (POMDP) framework for sequential inspection and maintenance of redundant systems, where the VoI analysis is embedded in the optimal selection of exploratory actions. We investigate the use of alternative approximate POMDP solvers for parallel and more general systems, compare their computation complexities and performance, and show how the inspection priorities depend on the economic discount factor, the degradation rate, the inspection precision, and the repair cost.


Sign in / Sign up

Export Citation Format

Share Document