scholarly journals PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network

Author(s):  
Bryan Wang ◽  
Yi-Hsuan Yang

Music creation is typically composed of two parts: composing the musical score, and then performing the score with instruments to make sounds. While recent work has made much progress in automatic music generation in the symbolic domain, few attempts have been made to build an AI model that can render realistic music audio from musical scores. Directly synthesizing audio with sound sample libraries often leads to mechanical and deadpan results, since musical scores do not contain performance-level information, such as subtle changes in timing and dynamics. Moreover, while the task may sound like a text-to-speech synthesis problem, there are fundamental differences since music audio has rich polyphonic sounds. To build such an AI performer, we propose in this paper a deep convolutional model that learns in an end-to-end manner the score-to-audio mapping between a symbolic representation of music called the pianorolls and an audio representation of music called the spectrograms. The model consists of two subnets: the ContourNet, which uses a U-Net structure to learn the correspondence between pianorolls and spectrograms and to give an initial result; and the TextureNet, which further uses a multi-band residual network to refine the result by adding the spectral texture of overtones and timbre. We train the model to generate music clips of the violin, cello, and flute, with a dataset of moderate size. We also present the result of a user study that shows our model achieves higher mean opinion score (MOS) in naturalness and emotional expressivity than a WaveNet-based model and two off-the-shelf synthesizers. We open our source code at https://github.com/bwang514/PerformanceNet

2019 ◽  
Vol 7 (3) ◽  
pp. 80-82
Author(s):  
Lawakesh Patel ◽  
Nidhi Singh ◽  
Rizwan Khan

Author(s):  
Robert J Marks II

The Fourier transform is not particularly conducive in the illustration of the evolution of frequency with respect to time. A representation of the temporal evolution of the spectral content of a signal is referred to as a time-frequency representation (TFR). The TFR, in essence, attempts to measure the instantaneous spectrum of a dynamic signal at each point in time. Musical scores, in their most fundamental interpretation, are TFR’s. The fundamental frequency of the note is represented by the vertical location of the note on the staff. Time progresses as we read notes from left to right. The musical score shown in Figure 9.1 is an example. Temporal assignment is given by the note types. The 120 next to the quarter note indicates the piece should be played at 120 beats per minute. Thus, the duration of a quarter note is one half second. The frequency of the A above middle C is, by international standards, 440 Hertz. Adjacent notes notes have a ratio of 21/12. The note, A#, for example, has a frequency of 440 × 21/12 = 466.1637615 Hertz. Middle C, nine half tones (a.k.a. semitones or chromatic steps) below A, has a frequency of 440 × 2−9/12 = 261.6255653 Hertz. The interval of an octave doubles the frequency. The frequency of an octave above A is twelve half tones, or, 440 × 212/12 = 880 Hertz. The frequency spacings in the time-frequency representation of musical scores such as Figure 9.1 are thus logarithmic. This is made more clear in the alternate representation of the musical score in Figure 9.2 where time is on the horizontal axis and frequency on the vertical. At every point in time where there is no rest, a frequency is assigned. To make chords, numerous frequencies can be assigned to a point in time. Further discussion of the technical theory of western harmony is in Section 13.1.


Author(s):  
Abigail Wiafe ◽  
Pasi Fränti

Affective algorithmic composition systems are emotionally intelligent automatic music generation systems that explore the current emotions or mood of a listener and compose an affective music to alter the person's mood to a predetermined one. The fusion of affective algorithmic composition systems and smart spaces have been identified to be beneficial. For instance, studies have shown that they can be used for therapeutic purposes. Amidst these benefits, research on its related security and ethical issues is lacking. This chapter therefore seeks to provoke discussion on security and ethical implications of using affective algorithmic compositions systems in smart spaces. It presents issues such as impersonation, eavesdropping, data tempering, malicious codes, and denial-of-service attacks associated with affective algorithmic composition systems. It also discusses some ethical implications relating to intensions, harm, and possible conflicts that users of such systems may experience.


Author(s):  
YUNG-SHENG CHEN ◽  
FENG-SHENG CHEN ◽  
CHIN-HUNG TENG

Optical Music Recognition (OMR) is a technique for converting printed musical documents into computer readable formats. In this paper, we present a simple OMR system that can perform well for ordinary musical documents such as ballad and pop music. This system is constructed based on fundamental image processing and pattern recognition techniques, thus it is easy to implement. Moreover, this system has a strong capability in skew restoration and inverted musical score detection. From a series of experiments, the error for our skew restoration is below 0.2° for any possible document rotation and the accuracy of inverted musical score detection is up to 98.89%. The overall recognition accuracy of our OMR can achieve to nearly 97%, a figure comparable with current commercial OMR software. However, if taking into image skew into consideration, our system is superior to commercial software in terms of recognition accuracy.


2018 ◽  
Vol 11 (3) ◽  
pp. 50 ◽  
Author(s):  
Yongjie Huang ◽  
Xiaofeng Huang ◽  
Qiakai Cai

In this paper, we propose a model that combines Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) for music generation. We first convert MIDI-format music file into a musical score matrix, and then establish convolution layers to extract feature of the musical score matrix. Finally, the output of the convolution layers is split in the direction of the time axis and input into the LSTM, so as to achieve the purpose of music generation. The result of the model was verified by comparison of accuracy, time-domain analysis, frequency-domain analysis and human-auditory evaluation. The results show that Convolution-LSTM performs better in music genertaion than LSTM, with more pronounced undulations and clearer melody.


Mathematics ◽  
2021 ◽  
Vol 9 (4) ◽  
pp. 387
Author(s):  
Shuyu Li ◽  
Yunsick Sung

Deep learning has made significant progress in the field of automatic music generation. At present, the research on music generation via deep learning can be divided into two categories: predictive models and generative models. However, both categories have the same problems that need to be resolved. First, the length of the music must be determined artificially prior to generation. Second, although the convolutional neural network (CNN) is unexpectedly superior to the recurrent neural network (RNN), CNN still has several disadvantages. This paper proposes a conditional generative adversarial network approach using an inception model (INCO-GAN), which enables the generation of complete variable-length music automatically. By adding a time distribution layer that considers sequential data, CNN considers the time relationship in a manner similar to RNN. In addition, the inception model obtains richer features, which improves the quality of the generated music. In experiments conducted, the music generated by the proposed method and that by human composers were compared. High cosine similarity of up to 0.987 was achieved between the frequency vectors, indicating that the music generated by the proposed method is very similar to that created by a human composer.


2020 ◽  
Author(s):  
Jiyanbo Cao ◽  
Jinan Fiaidhi ◽  
Maolin Qi

This paper has reviewed the deep learning techniques which used in music generation. The research was based on <i>Sageev Oore's</i> proposed LSTM based recurrent neural network (Performance RNN). We have study the history of automatic music generation, and now we are using a state of the art techniques to achieve this mission. We have conclude the process of making a MIDI file to a structure as input of Performance RNN and the network structure of it.


Sign in / Sign up

Export Citation Format

Share Document