Monaural Singing Voice Separation Using Fusion-Net with Time-Frequency Masking

Author(s):  
Feng Li ◽  
Kaizhi Qian ◽  
Mark Hasegawa-Johnson ◽  
Masato Akagi
2020 ◽  
Vol 10 (5) ◽  
pp. 1727 ◽  
Author(s):  
Woon-Haeng Heo ◽  
Hyemi Kim ◽  
Oh-Wook Kwon

We propose a source separation architecture using dilated time-frequency DenseNet for background music identification of broadcast content. We apply source separation techniques to the mixed signals of music and speech. For the source separation purpose, we propose a new architecture to add a time-frequency dilated convolution to the conventional DenseNet in order to effectively increase the receptive field in the source separation scheme. In addition, we apply different convolutions to each frequency band of the spectrogram in order to reflect the different frequency characteristics of the low- and high-frequency bands. To verify the performance of the proposed architecture, we perform singing-voice separation and music-identification experiments. As a result, we confirm that the proposed architecture produces the best performance in both experiments because it uses the dilated convolution to reflect wide contextual information.


2020 ◽  
Vol 10 (7) ◽  
pp. 2465
Author(s):  
Seungtae Kang ◽  
Jeong-Sik Park  ◽  
Gil-Jin Jang 

Single-channel singing voice separation has been considered a difficult task, as it requires predicting two different audio sources independently from mixed vocal and instrument sounds recorded by a single microphone. We propose a new singing voice separation approach based on the curriculum learning framework, in which learning is started with only easy examples and then task difficulty is gradually increased. In this study, we regard the data providing obviously dominant characteristics of a single source as an easy case and the other data as a difficult case. To quantify the dominance property between two sources, we define a dominance factor that determines a difficulty level according to relative intensity between vocal sound and instrument sound. If a given data is determined to provide obviously dominant characteristics of a single source according to the factor, it is regarded as an easy case; otherwise, it belongs to a difficult case. Early stages in the learning focus on easy cases, thus allowing rapidly learning overall characteristics of each source. On the other hand, later stages handle difficult cases, allowing more careful and sophisticated learning. In experiments conducted on three song datasets, the proposed approach demonstrated superior performance compared to the conventional approaches.


Sign in / Sign up

Export Citation Format

Share Document