Multi-layer Attention Mechanism Based Speech Separation Model

Speech information is the most important means of human communication, and it is crucial to separate the target voice from the mixed sound signals. This paper proposes a speech separation model based on convolutional neural networks and attention mechanism. The magnitude spectrum of the mixed speech signals, as the input, has its high dimensionality. By analyzing the characteristics of the convolutional neural network and attention mechanism, it can be found that the convolutional neural network can effectively extract low-dimensional features and mine the spatiotemporal structure information in the speech signals, and the attention mechanism can reduce the loss of sequence information. The accuracy of speech separation can be improved effectively by combining two mechanisms. Compared to the typical speech separation model DRNN-2 + discrim, this method achieves 0.27 dB GNSDR gain and 0.51 dB GSIR gain, which illustrates that the speech separation model proposed in this paper has achieved an ideal separation effect.

Download Full-text

Ultra Fast Speech Separation Model with Teacher Student Learning

10.21437/interspeech.2021-142 ◽

2021 ◽

Author(s):

Sanyuan Chen ◽

Yu Wu ◽

Zhuo Chen ◽

Jian Wu ◽

Takuya Yoshioka ◽

...

Keyword(s):

Student Learning ◽

Speech Separation ◽

Teacher Student ◽

Separation Model

Download Full-text

Implementation of Real-Time Speech Separation Model Using Time-Domain Audio Separation Network (TasNet) and Dual-Path Recurrent Neural Network (DPRNN)

Procedia Computer Science ◽

10.1016/j.procs.2021.01.065 ◽

2021 ◽

Vol 179 ◽

pp. 762-772

Author(s):

Alfian Wijayakusuma ◽

Davin Reinaldo Gozali ◽

Anthony Widjaja ◽

Hanry Ham

Keyword(s):

Neural Network ◽

Real Time ◽

Recurrent Neural Network ◽

Time Domain ◽

Speech Separation ◽

Separation Model

Download Full-text

Enabling an anechoic U-Net based speech separation model for online and offline applications in reverberant conditions

Applied Acoustics ◽

10.1016/j.apacoust.2021.108039 ◽

2021 ◽

Vol 179 ◽

pp. 108039

Author(s):

Sania Gul ◽

Muhammad Salman Khan ◽

Ata ur rehman ◽

Syed WaqarShah

Keyword(s):

Speech Separation ◽

Separation Model

Download Full-text

TransMask: A Compact and Fast Speech Separation Model Based on Transformer

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp39728.2021.9413670 ◽

2021 ◽

Author(s):

Zining Zhang ◽

Bingsheng He ◽

Zhenjie Zhang

Keyword(s):

Speech Separation ◽

Model Based ◽

Separation Model

Download Full-text

Deep Audio-Visual Speech Separation with Attention Mechanism

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp40776.2020.9054180 ◽

2020 ◽

Author(s):

Chenda Li ◽

Yanmin Qian

Keyword(s):

Attention Mechanism ◽

Visual Speech ◽

Speech Separation

Download Full-text

A convolutional recurrent neural network with attention framework for speech separation in monaural recordings

Scientific Reports ◽

10.1038/s41598-020-80713-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Chao Sun ◽

Min Zhang ◽

Ruijuan Wu ◽

Junhong Lu ◽

Guo Xian ◽

...

Keyword(s):

Neural Network ◽

Recurrent Neural Network ◽

Attention Mechanism ◽

Single Type ◽

Translation Invariance ◽

Separation Performance ◽

Speech Separation ◽

Feature Maps ◽

Front End ◽

The Time Domain

AbstractMost speech separation studies in monaural channel use only a single type of network, and the separation effect is typically not satisfactory, posing difficulties for high quality speech separation. In this study, we propose a convolutional recurrent neural network with an attention (CRNN-A) framework for speech separation, fusing advantages of two networks together. The proposed separation framework uses a convolutional neural network (CNN) as the front-end of a recurrent neural network (RNN), alleviating the problem that a sole RNN cannot effectively learn the necessary features. This framework makes use of the translation invariance provided by CNN to extract information without modifying the original signals. Within the supplemented CNN, two different convolution kernels are designed to capture information in both the time and frequency domains of the input spectrogram. After concatenating the time-domain and the frequency-domain feature maps, the feature information of speech is exploited through consecutive convolutional layers. Finally, the feature map learned from the front-end CNN is combined with the original spectrogram and is sent to the back-end RNN. Further, the attention mechanism is further incorporated, focusing on the relationship among different feature maps. The effectiveness of the proposed method is evaluated on the standard dataset MIR-1K and the results prove that the proposed method outperforms the baseline RNN and other popular speech separation methods, in terms of GNSDR (gloabl normalised source-to-distortion ratio), GSIR (global source-to-interferences ratio), and GSAR (gloabl source-to-artifacts ratio). In summary, the proposed CRNN-A framework can effectively combine the advantages of CNN and RNN, and further optimise the separation performance via the attention mechanism. The proposed framework can shed a new light on speech separation, speech enhancement, and other related fields.

Download Full-text