Automatic Melody Harmonization via Reinforcement Learning by Exploring Structured Representations for Melody Sequences

We present a novel reinforcement learning architecture that learns a structured representation for use in symbolic melody harmonization. Probabilistic models are predominant in melody harmonization tasks, most of which only treat melody notes as independent observations and do not take note of substructures in the melodic sequence. To fill this gap, we add substructure discovery as a crucial step in automatic chord generation. The proposed method consists of a structured representation module that generates hierarchical structures for the symbolic melodies, a policy module that learns to break a melody into segments (whose boundaries concur with chord changes) and phrases (the subunits in segments), and a harmonization module that generates chord sequences for each segment. We formulate the structure discovery process as a sequential decision problem with a policy gradient RL method selecting the boundary of each segment or phrase to obtain an optimized structure. We conduct experiments on our preprocessed HookTheory Lead Sheet Dataset, which has 17,979 melody/chord pairs. The results demonstrate that our proposed method can learn task-specific representations and, thus, yield competitive results compared with state-of-the-art baselines.

Download Full-text

On the use of the policy gradient and Hessian in inverse reinforcement learning

Intelligenza Artificiale ◽

10.3233/ia-180011 ◽

2020 ◽

Vol 14 (1) ◽

pp. 117-150

Author(s):

Alberto Maria Metelli ◽

Matteo Pirotta ◽

Marcello Restelli

Keyword(s):

Reinforcement Learning ◽

Sequential Decision ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Model Free ◽

Learning Speed ◽

Policy Gradient ◽

Continuous Domains ◽

Learning Policies ◽

Finite Domains

Reinforcement Learning (RL) is an effective approach to solve sequential decision making problems when the environment is equipped with a reward function to evaluate the agent’s actions. However, there are several domains in which a reward function is not available and difficult to estimate. When samples of expert agents are available, Inverse Reinforcement Learning (IRL) allows recovering a reward function that explains the demonstrated behavior. Most of the classic IRL methods, in addition to expert’s demonstrations, require sampling the environment to evaluate each reward function, that, in turn, is built starting from a set of engineered features. This paper is about a novel model-free IRL approach that does not require to specify a function space where to search for the expert’s reward function. Leveraging on the fact that the policy gradient needs to be zero for an optimal policy, the algorithm generates an approximation space for the reward function, in which a reward is singled out employing a second-order criterion. After introducing our approach for finite domains, we extend it to continuous ones. The empirical results, on both finite and continuous domains, show that the reward function recovered by our algorithm allows learning policies that outperform those obtained with the true reward function, in terms of learning speed.

Download Full-text

Modelling Sentence Pairs via Reinforcement Learning: An Actor-Critic Approach to Learn the Irrelevant Words

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6230 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7358-7366

Author(s):

Mahtab Ahmed ◽

Robert E. Mercer

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Language Processing ◽

Learning Task ◽

Strong Impact ◽

Sequential Decision ◽

Sentence Pair ◽

Current State ◽

Policy Gradient ◽

The Rich

Learning sentence representation is a fundamental task in Natural Language Processing. Most of the existing sentence pair modelling architectures focus only on extracting and using the rich sentence pair features. The drawback of utilizing all of these features makes the learning process much harder. In this study, we propose a reinforcement learning (RL) method to learn a sentence pair representation when performing tasks like semantic similarity, paraphrase identification, and question-answer pair modelling. We formulate this learning problem as a sequential decision making task where the decision made in the current state will have a strong impact on the following decisions. We address this decision making with a policy gradient RL method which chooses the irrelevant words to delete by looking at the sub-optimal representation of the sentences being compared. With this policy, extensive experiments show that our model achieves on par performance when learning task-specific representations of sentence pairs without needing any further knowledge like parse trees. We suggest that the simplicity of each task inference provided by our RL model makes it easier to explain.

Download Full-text

Reinforcement Learning based on MPC and the Stochastic Policy Gradient Method

2021 American Control Conference (ACC) ◽

10.23919/acc50511.2021.9482765 ◽

2021 ◽

Author(s):

Sebastien Gros ◽

Mario Zanon

Keyword(s):

Reinforcement Learning ◽

Gradient Method ◽

Policy Gradient

Download Full-text

Selective network discovery via deep reinforcement learning on embedded spaces

Applied Network Science ◽

10.1007/s41109-021-00365-8 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Peter Morales ◽

Rajmonda Sulo Caceres ◽

Tina Eliassi-Rad

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Sequential Decision ◽

Network Discovery ◽

Learning Tasks ◽

Partially Observed ◽

Decision Making Problem ◽

Resource Collection ◽

Improved Performance ◽

Discovery Algorithms

AbstractComplex networks are often either too large for full exploration, partially accessible, or partially observed. Downstream learning tasks on these incomplete networks can produce low quality results. In addition, reducing the incompleteness of the network can be costly and nontrivial. As a result, network discovery algorithms optimized for specific downstream learning tasks given resource collection constraints are of great interest. In this paper, we formulate the task-specific network discovery problem as a sequential decision-making problem. Our downstream task is selective harvesting, the optimal collection of vertices with a particular attribute. We propose a framework, called network actor critic (NAC), which learns a policy and notion of future reward in an offline setting via a deep reinforcement learning algorithm. The NAC paradigm utilizes a task-specific network embedding to reduce the state space complexity. A detailed comparative analysis of popular network embeddings is presented with respect to their role in supporting offline planning. Furthermore, a quantitative study is presented on various synthetic and real benchmarks using NAC and several baselines. We show that offline models of reward and network discovery policies lead to significantly improved performance when compared to competitive online discovery algorithms. Finally, we outline learning regimes where planning is critical in addressing sparse and changing reward signals.

Download Full-text

Strategy Generation Based on Reinforcement Learning with Deep Deterministic Policy Gradient for UCAV

2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV) ◽

10.1109/icarcv50220.2020.9305446 ◽

2020 ◽

Author(s):

Yunhong Ma ◽

Shuyao Bai ◽

Yifei Zhao ◽

Chao Song ◽

Jie Yang

Keyword(s):

Reinforcement Learning ◽

Policy Gradient

Download Full-text

Optimal Policies for Quantum Markov Decision Processes

International Journal of Automation and Computing ◽

10.1007/s11633-021-1278-z ◽

2021 ◽

Author(s):

Ming-Sheng Ying ◽

Yuan Feng ◽

Sheng-Gang Ying

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Quantum Systems ◽

Sequential Decision Making ◽

Mathematical Framework ◽

Sequential Decision ◽

Learning Techniques ◽

Optimal Policies ◽

Markov Decision ◽

Programming Algorithms

AbstractMarkov decision process (MDP) offers a general framework for modelling sequential decision making where outcomes are random. In particular, it serves as a mathematical framework for reinforcement learning. This paper introduces an extension of MDP, namely quantum MDP (qMDP), that can serve as a mathematical model of decision making about quantum systems. We develop dynamic programming algorithms for policy evaluation and finding optimal policies for qMDPs in the case of finite-horizon. The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.

Download Full-text

Formula-E race strategy development using distributed policy gradient reinforcement learning

Knowledge-Based Systems ◽

10.1016/j.knosys.2021.106781 ◽

2021 ◽

Vol 216 ◽

pp. 106781

Author(s):

Xuze Liu ◽

Abbas Fotouhi ◽

Daniel J. Auger

Keyword(s):

Reinforcement Learning ◽

Strategy Development ◽

Policy Gradient

Download Full-text

Bias Correction in Reinforcement Learning via the Deterministic Policy Gradient Method for MPC-Based Policies

2021 American Control Conference (ACC) ◽

10.23919/acc50511.2021.9483016 ◽

2021 ◽

Author(s):

Sebastien Gros ◽

Mario Zanon

Keyword(s):

Reinforcement Learning ◽

Gradient Method ◽

Bias Correction ◽

Policy Gradient

Download Full-text

Reinforced Knowledge Distillation: Multi-class Imbalanced Classifier Based on Policy Gradient Reinforcement Learning

Neurocomputing ◽

10.1016/j.neucom.2021.08.040 ◽

2021 ◽

Author(s):

Saite Fan ◽

Xinmin Zhang ◽

Zhihuan Song

Keyword(s):

Reinforcement Learning ◽

Policy Gradient ◽

Knowledge Distillation

Download Full-text

Preceding vehicle following algorithm with human driving characteristics

Proceedings of the Institution of Mechanical Engineers Part D Journal of Automobile Engineering ◽

10.1177/0954407020981546 ◽

2021 ◽

pp. 095440702098154

Author(s):

Feng Pan ◽

Hong Bao

Keyword(s):

Reinforcement Learning ◽

Weight Vector ◽

Gradient Algorithm ◽

Inner Product ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Human Driver ◽

Policy Gradient ◽

Preceding Vehicle ◽

Action Spaces

This paper proposes a new approach of using reinforcement learning (RL) to train an agent to perform the task of vehicle following with human driving characteristics. We refer to the ideal of inverse reinforcement learning to design the reward function of the RL model. The factors that need to be weighed in vehicle following were vectorized into reward vectors, and the reward function was defined as the inner product of the reward vector and weights. Driving data of human drivers was collected and analyzed to obtain the true reward function. The RL model was trained with the deterministic policy gradient algorithm because the state and action spaces are continuous. We adjusted the weight vector of the reward function so that the value vector of the RL model could continuously approach that of a human driver. After dozens of rounds of training, we selected the policy with the nearest value vector to that of a human driver and tested it in the PanoSim simulation environment. The results showed the desired performance for the task of an agent following the preceding vehicle safely and smoothly.

Download Full-text