LSTM-DDPG for Trading with Variable Positions

In recent years, machine learning for trading has been widely studied. The direction and size of position should be determined in trading decisions based on market conditions. However, there is no research so far that considers variable position sizes in models developed for trading purposes. In this paper, we propose a deep reinforcement learning model named LSTM-DDPG to make trading decisions with variable positions. Specifically, we consider the trading process as a Partially Observable Markov Decision Process, in which the long short-term memory (LSTM) network is used to extract market state features and the deep deterministic policy gradient (DDPG) framework is used to make trading decisions concerning the direction and variable size of position. We test the LSTM-DDPG model on IF300 (index futures of China stock market) data and the results show that LSTM-DDPG with variable positions performs better in terms of return and risk than models with fixed or few-level positions. In addition, the investment potential of the model can be better tapped by the reward function of the differential Sharpe ratio than that of profit reward function.

Download Full-text

Pseudo Random Number Generation through Reinforcement Learning and Recurrent Neural Networks

Algorithms ◽

10.3390/a13110307 ◽

2020 ◽

Vol 13 (11) ◽

pp. 307

Author(s):

Luca Pasqualini ◽

Maurizio Parton

Keyword(s):

Reinforcement Learning ◽

Random Number ◽

Short Term Memory ◽

Random Number Generator ◽

Random Number Generation ◽

Time Step ◽

Software Applications ◽

Pseudo Random Number ◽

Markov Decision ◽

Partially Observable

A Pseudo-Random Number Generator (PRNG) is any algorithm generating a sequence of numbers approximating properties of random numbers. These numbers are widely employed in mid-level cryptography and in software applications. Test suites are used to evaluate the quality of PRNGs by checking statistical properties of the generated sequences. These sequences are commonly represented bit by bit. This paper proposes a Reinforcement Learning (RL) approach to the task of generating PRNGs from scratch by learning a policy to solve a partially observable Markov Decision Process (MDP), where the full state is the period of the generated sequence, and the observation at each time-step is the last sequence of bits appended to such states. We use Long-Short Term Memory (LSTM) architecture to model the temporal relationship between observations at different time-steps by tasking the LSTM memory with the extraction of significant features of the hidden portion of the MDP’s states. We show that modeling a PRNG with a partially observable MDP and an LSTM architecture largely improves the results of the fully observable feedforward RL approach introduced in previous work.

Download Full-text

Constrained representation learning for recurrent policy optimisation under uncertainty

Adaptive Behavior ◽

10.1177/1059712319891641 ◽

2019 ◽

pp. 105971231989164

Author(s):

Viet-Hung Dang ◽

Ngo Anh Vien ◽

TaeChoong Chung

Keyword(s):

Short Term Memory ◽

Internal State ◽

Representation Learning ◽

Linear Mapping ◽

State Representation ◽

Future Observation ◽

Non Linear ◽

Internal States ◽

Markov Decision ◽

Partially Observable

Learning to make decisions in partially observable environments is a notorious problem that requires a complex representation of controllers. In most work, the controllers are designed as a non-linear mapping from a sequence of temporal observations to actions. These problems can, in principle, be formulated as a partially observable Markov decision process whose policy can be parameterised through the use of recurrent neural networks. In this paper, we will propose an alternative framework that (a) uses the Long-Short-Term-Memory (LSTM) Encoder-Decoder framework to learn an internal state representation for historical observations and then (b) integrates it into existing recurrent policy models to improve the task performance. The LSTM Encoder encodes a history of observations as input into a representation of internal states. The LSTM Decoder can perform two alternative decoding tasks: predicting the same input observation sequence or predicting future observation sequences. The first proposed decoder acts like an auto-encoder that will guide and constrain the learning of a useful internal state for the policy optimisation task. The second proposed decoder decodes the learnt internal state by the encoder to predict future observation sequences. This idea makes the network act like a non-linear predictive state representation model. Both these decoding parts, which introduce constraints to policy representation, will help guide both the policy optimisation problem and latent state representation learning. The integration of representation learning and policy optimisation aims to help learn more complex policies and improve the performance of policy learning tasks.

Download Full-text

Experiments with Infinite-Horizon, Policy-Gradient Estimation

Journal of Artificial Intelligence Research ◽

10.1613/jair.807 ◽

2001 ◽

Vol 15 ◽

pp. 351-381 ◽

Cited By ~ 63

Author(s):

J. Baxter ◽

P. L. Bartlett ◽

L. Weaver

Keyword(s):

Infinite Horizon ◽

Companion Paper ◽

Gradient Algorithm ◽

Gradient Estimates ◽

Stochastic Gradient Algorithm ◽

Line Searches ◽

Gradient Ascent ◽

Policy Gradient ◽

Markov Decision ◽

Partially Observable

In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, this volume), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter beta, which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of (Baxter & Bartlett, this volume) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems.

Download Full-text

Contracts for Difference: A Reinforcement Learning Approach

Journal of Risk and Financial Management ◽

10.3390/jrfm13040078 ◽

2020 ◽

Vol 13 (4) ◽

pp. 78

Author(s):

Nico Zengeler ◽

Uwe Handmann

Keyword(s):

Reinforcement Learning ◽

Short Term Memory ◽

Learning Agents ◽

Learning Framework ◽

Learning Agent ◽

Markov Decision ◽

Economic Trends ◽

Model Size ◽

Contracts For Difference ◽

Partially Observable

We present a deep reinforcement learning framework for an automatic trading of contracts for difference (CfD) on indices at a high frequency. Our contribution proves that reinforcement learning agents with recurrent long short-term memory (LSTM) networks can learn from recent market history and outperform the market. Usually, these approaches depend on a low latency. In a real-world example, we show that an increased model size may compensate for a higher latency. As the noisy nature of economic trends complicates predictions, especially in speculative assets, our approach does not predict courses but instead uses a reinforcement learning agent to learn an overall lucrative trading policy. Therefore, we simulate a virtual market environment, based on historical trading data. Our environment provides a partially observable Markov decision process (POMDP) to reinforcement learners and allows the training of various strategies.

Download Full-text

A Self-Organizing Incremental Spatiotemporal Associative Memory Networks Model for Problems with Hidden State

Computational Intelligence and Neuroscience ◽

10.1155/2016/7158507 ◽

2016 ◽

Vol 2016 ◽

pp. 1-14

Author(s):

Zuo-wei Wang

Keyword(s):

Associative Memory ◽

State Transition ◽

Short Term Memory ◽

Activation Mechanism ◽

Long Term Memory ◽

Transition Model ◽

Term Memory ◽

Markov Decision ◽

State Transition Model ◽

Partially Observable

Identifying the hidden state is important for solving problems with hidden state. We prove any deterministic partially observable Markov decision processes (POMDP) can be represented by a minimal, looping hidden state transition model and propose a heuristic state transition model constructing algorithm. A new spatiotemporal associative memory network (STAMN) is proposed to realize the minimal, looping hidden state transition model. STAMN utilizes the neuroactivity decay to realize the short-term memory, connection weights between different nodes to represent long-term memory, presynaptic potentials, and synchronized activation mechanism to complete identifying and recalling simultaneously. Finally, we give the empirical illustrations of the STAMN and compare the performance of the STAMN model with that of other methods.

Download Full-text

Sorting Objects from a Conveyor Belt Using POMDPs with Multiple-Object Observations and Information-Gain Rewards

Sensors ◽

10.3390/s20092481 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2481

Author(s):

Ady-Daniel Mezei ◽

Levente Tamás ◽

Lucian Buşoniu

Keyword(s):

Information Gain ◽

Real Life ◽

Conveyor Belt ◽

Object Class ◽

Viewpoint Selection ◽

Reward Function ◽

Multiple Observations ◽

Markov Decision ◽

Partially Observable ◽

The Impact

We consider a robot that must sort objects transported by a conveyor belt into different classes. Multiple observations must be performed before taking a decision on the class of each object, because the imperfect sensing sometimes detects the incorrect object class. The objective is to sort the sequence of objects in a minimal number of observation and decision steps. We describe this task in the framework of partially observable Markov decision processes, and we propose a reward function that explicitly takes into account the information gain of the viewpoint selection actions applied. The DESPOT algorithm is applied to solve the problem, automatically obtaining a sequence of observation viewpoints and class decision actions. Observations are made either only for the object on the first position of the conveyor belt or for multiple adjacent positions at once. The performance of the single- and multiple-position variants is compared, and the impact of including the information gain is analyzed. Real-life experiments with a Baxter robot and an industrial conveyor belt are provided.

Download Full-text

Optimal price-based control of heterogeneous thermostatically controlled loads under uncertainty using LSTM networks and genetic algorithms

F1000Research ◽

10.12688/f1000research.20421.1 ◽

2019 ◽

Vol 8 ◽

pp. 1619

Author(s):

Taha Abdelhalim Nakabi ◽

Pekka Toivanen

Keyword(s):

Short Term Memory ◽

Profit Maximization ◽

Optimal Solution ◽

Pricing Policy ◽

Pricing Scheme ◽

Markov Decision ◽

Control Response ◽

Thermostatically Controlled Loads ◽

Lstm Network ◽

The Individual

In this paper, we consider the problem of thermostatically controlled load (TCL) control through dynamic electricity prices, under partial observability of the environment and uncertainty of the control response. The problem is formulated as a Markov decision process where an agent must find a near-optimal pricing scheme using partial observations of the state and action. We propose a long-short-term memory (LSTM) network to learn the individual behaviors of TCL units. We use the aggregated information to predict the response of the TCL cluster to a pricing policy. We use this prediction model in a genetic algorithm to find the best prices in terms of profit maximization in an energy arbitrage operation. The simulation results show that the proposed method offers a profit equal to 96% of the theoretical optimal solution.

Download Full-text

An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions

Neural Computation ◽

10.1162/neco_a_00808 ◽

2016 ◽

Vol 28 (3) ◽

pp. 563-593 ◽

Cited By ~ 1

Author(s):

Yao Ma ◽

Tingting Zhao ◽

Kohei Hatano ◽

Masashi Sugiyama

Keyword(s):

Gradient Algorithm ◽

Learning Problem ◽

Parameter Spaces ◽

State Action ◽

Reward Function ◽

Continuous State ◽

Policy Gradient ◽

Markov Decision ◽

The Difference ◽

Changes Over Time

We consider the learning problem under an online Markov decision process (MDP) aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret—the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this letter, we show that a simple online policy gradient algorithm achieves regret [Formula: see text] for T steps under a certain concavity assumption and [Formula: see text] under a strong concavity assumption. To the best of our knowledge, this is the first work to present an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the proposed online policy gradient method through experiments.

Download Full-text

Inverse reinforcement learning in contextual MDPs

Machine Learning ◽

10.1007/s10994-021-05984-x ◽

2021 ◽

Author(s):

Stav Belogolovsky ◽

Philip Korsunsky ◽

Shie Mannor ◽

Chen Tessler ◽

Tom Zahavy

Keyword(s):

Reinforcement Learning ◽

Optimization Problem ◽

Decision Processes ◽

Inverse Reinforcement Learning ◽

Convex Optimization Problem ◽

Reward Function ◽

Dynamic Treatment Regime ◽

Markov Decision ◽

Dynamic Treatment ◽

Recorded Data

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Download Full-text

Optimal adaptive inspection and maintenance for redundant systems

Proceedings of the Institution of Mechanical Engineers Part O Journal of Risk and Reliability ◽

10.1177/1748006x211020151 ◽

2021 ◽

pp. 1748006X2110201

Author(s):

Chaochao Lin ◽

Matteo Pozzi

Keyword(s):

Engineering Systems ◽

Discounted Cost ◽

Markov Decision ◽

Inspection And Maintenance ◽

And Performance ◽

Partially Observable ◽

Series Systems ◽

Selection Of ◽

Redundant Systems

Optimal exploration of engineering systems can be guided by the principle of Value of Information (VoI), which accounts for the topological important of components, their reliability and the management costs. For series systems, in most cases higher inspection priority should be given to unreliable components. For redundant systems such as parallel systems, analysis of one-shot decision problems shows that higher inspection priority should be given to more reliable components. This paper investigates the optimal exploration of redundant systems in long-term decision making with sequential inspection and repairing. When the expected, cumulated, discounted cost is considered, it may become more efficient to give higher inspection priority to less reliable components, in order to preserve system redundancy. To investigate this problem, we develop a Partially Observable Markov Decision Process (POMDP) framework for sequential inspection and maintenance of redundant systems, where the VoI analysis is embedded in the optimal selection of exploratory actions. We investigate the use of alternative approximate POMDP solvers for parallel and more general systems, compare their computation complexities and performance, and show how the inspection priorities depend on the economic discount factor, the degradation rate, the inspection precision, and the repair cost.

Download Full-text