scholarly journals Multi-Scale Group Transformer for Long Sequence Modeling in Speech Separation

Author(s):  
Yucheng Zhao ◽  
Chong Luo ◽  
Zheng-Jun Zha ◽  
Wenjun Zeng

In this paper, we introduce Transformer to the time-domain methods for single-channel speech separation. Transformer has the potential to boost speech separation performance because of its strong sequence modeling capability. However, its computational complexity, which grows quadratically with the sequence length, has made it largely inapplicable to speech applications. To tackle this issue, we propose a novel variation of Transformer, named multi-scale group Transformer (MSGT). The key ideas are group self-attention, which significantly reduces the complexity, and multi-scale fusion, which retains Transform's ability to capture long-term dependency. We implement two versions of MSGT with different complexities, and apply them to a well-known time-domain speech separation method called Conv-TasNet. By simply replacing the original temporal convolutional network (TCN) with MSGT, our approach called MSGT-TasNet achieves a large gain over Conv-TasNet on both WSJ0-2mix and WHAM! benchmarks. Without bells and whistles, the performance of MSGT-TasNet is already on par with the SOTA methods.

Entropy ◽  
2021 ◽  
Vol 23 (1) ◽  
pp. 116
Author(s):  
Xiangfa Zhao ◽  
Guobing Sun

Automatic sleep staging with only one channel is a challenging problem in sleep-related research. In this paper, a simple and efficient method named PPG-based multi-class automatic sleep staging (PMSS) is proposed using only a photoplethysmography (PPG) signal. Single-channel PPG data were obtained from four categories of subjects in the CAP sleep database. After the preprocessing of PPG data, feature extraction was performed from the time domain, frequency domain, and nonlinear domain, and a total of 21 features were extracted. Finally, the Light Gradient Boosting Machine (LightGBM) classifier was used for multi-class sleep staging. The accuracy of the multi-class automatic sleep staging was over 70%, and the Cohen’s kappa statistic k was over 0.6. This also showed that the PMSS method can also be applied to stage the sleep state for patients with sleep disorders.


Electronics ◽  
2020 ◽  
Vol 9 (9) ◽  
pp. 1458
Author(s):  
Xulong Zhang ◽  
Yi Yu ◽  
Yongwei Gao ◽  
Xi Chen ◽  
Wei Li

Singing voice detection or vocal detection is a classification task that determines whether a given audio segment contains singing voices. This task plays a very important role in vocal-related music information retrieval tasks, such as singer identification. Although humans can easily distinguish between singing and nonsinging parts, it is still very difficult for machines to do so. Most existing methods focus on audio feature engineering with classifiers, which rely on the experience of the algorithm designer. In recent years, deep learning has been widely used in computer hearing. To extract essential features that reflect the audio content and characterize the vocal context in the time domain, this study adopted a long-term recurrent convolutional network (LRCN) to realize vocal detection. The convolutional layer in LRCN functions in feature extraction, and the long short-term memory (LSTM) layer can learn the time sequence relationship. The preprocessing of singing voices and accompaniment separation and the postprocessing of time-domain smoothing were combined to form a complete system. Experiments on five public datasets investigated the impacts of the different features for the fusion, frame size, and block size on LRCN temporal relationship learning, and the effects of preprocessing and postprocessing on performance, and the results confirm that the proposed singing voice detection algorithm reached the state-of-the-art level on public datasets.


1987 ◽  
Vol 96 (1_suppl) ◽  
pp. 62-64 ◽  
Author(s):  
J. B. Millar ◽  
L. F. A. Martin ◽  
Y. C. Tong ◽  
G. M. Clark

A modified speech-processing strategy incorporating the temporal coding of information strongly correlated with the first formant of speech was evaluated in a long-term clinical experiment with a single patient. The aim was to assess whether the patient could learn to extract information from the time domain in addition to the time domain cues for voice excitation frequency already received from the initial strategy. It was found that the patient gained no significant advantage from the modified strategy, but there was no disadvantage either, and the patient expressed a preference for the modified strategy for everyday use.


2020 ◽  
Vol 34 (04) ◽  
pp. 5061-5068
Author(s):  
Qianli Ma ◽  
Zhenxi Lin ◽  
Enhuan Chen ◽  
Garrison Cottrell

Learning long-term and multi-scale dependencies in sequential data is a challenging task for recurrent neural networks (RNNs). In this paper, a novel RNN structure called temporal pyramid RNN (TP-RNN) is proposed to achieve these two goals. TP-RNN is a pyramid-like structure and generally has multiple layers. In each layer of the network, there are several sub-pyramids connected by a shortcut path to the output, which can efficiently aggregate historical information from hidden states and provide many gradient feedback short-paths. This avoids back-propagating through many hidden states as in usual RNNs. In particular, in the multi-layer structure of TP-RNN, the input sequence of the higher layer is a large-scale aggregated state sequence produced by the sub-pyramids in the previous layer, instead of the usual sequence of hidden states. In this way, TP-RNN can explicitly learn multi-scale dependencies with multi-scale input sequences of different layers, and shorten the input sequence and gradient feedback paths of each layer. This avoids the vanishing gradient problem in deep RNNs and allows the network to efficiently learn long-term dependencies. We evaluate TP-RNN on several sequence modeling tasks, including the masked addition problem, pixel-by-pixel image classification, signal recognition and speaker identification. Experimental results demonstrate that TP-RNN consistently outperforms existing RNNs for learning long-term and multi-scale dependencies in sequential data.


Author(s):  
Yidan Gao ◽  
Ying Min Low

A floating production system is exposed to many different environmental conditions over its service life. Consequently, the long-term fatigue analysis of deepwater risers is computationally demanding due to the need to evaluate the fatigue damage from a multitude of sea states. Because of the nonlinearities in the system, the dynamic analysis is often performed in the time domain. This further compounds the computational difficulty owing to the time consuming nature of time domain analysis, as well as the need to simulate a sufficient duration for each sea state to minimize sampling variability. This paper presents a new and efficient simulation technique for long-term fatigue analysis. The results based on this new technique are compared against those obtained from the direct simulation of numerous sea states.


Sign in / Sign up

Export Citation Format

Share Document