End-to-End Monaural Speech Separation with a Deep Complex U-Shaped Network

Author(s):  
Wen Zhang ◽  
Xiaoyong Li ◽  
Aolong Zhou ◽  
Kefeng Deng ◽  
Kaijun Ren ◽  
...  

Conventional time–frequency (TF) domain source separation methods mainly focus on predicting TF-masks or speech spectrums, where complex ideal ratio mask (cIRM) is an effective target for speech enhancement and separation. However, some recent studies employ a real-valued network, such as a general convolutional neural network (CNN) and a recurrent neural network (RNN), to predict a complex-valued mask or a spectrogram target, leading to the unbalanced training results of real and imaginary parts. In this paper, to estimate the complex-valued target more accurately, a novel U-shaped complex network for the complex signal approximation (uCSA) method is proposed. The uCSA is an adaptive front-end time-domain separation method, which tackles the monaural source separation problem in three ways. First, we design and implement a complex U-shaped network architecture comprising well-defined complex-valued encoder and decoder blocks, as well as complex-valued bidirectional Long Short-Term Memory (BLSTM) layers, to process complex-valued operations. Second, the cIRM is the training target of our uCSA method, optimized by signal approximation (SA), which takes advantage of both real and imaginary components of the complex-valued spectrum. Third, we re-formulate STFT and inverse STFT into derivable formats, and the model is trained with the scale-invariant source-to-noise ratio (SI-SNR) loss, achieving end-to-end training of the speech source separation task. Moreover, the proposed uCSA models are evaluated on the WSJ0-2mix datasets, which is a valid corpus commonly used by many supervised speech separation methods. Extensive experimental results indicate that our proposed method obtains state-of-the-art performance on the basis of the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) metrics.

Author(s):  
Ratish Puduppully ◽  
Li Dong ◽  
Mirella Lapata

Recent advances in data-to-text generation have led to the use of large-scale datasets and neural network models which are trained end-to-end, without explicitly modeling what to say and in what order. In this work, we present a neural network architecture which incorporates content selection and planning without sacrificing end-to-end training. We decompose the generation task into two stages. Given a corpus of data records (paired with descriptive documents), we first generate a content plan highlighting which information should be mentioned and in which order and then generate the document while taking the content plan into account. Automatic and human-based evaluation experiments show that our model1 outperforms strong baselines improving the state-of-the-art on the recently released RotoWIRE dataset.


Sensors ◽  
2021 ◽  
Vol 21 (18) ◽  
pp. 6302
Author(s):  
Xupei Zhang ◽  
Zhanzhuang He ◽  
Zhong Ma ◽  
Peng Jun ◽  
Kun Yang

Altitude estimation is one of the fundamental tasks of unmanned aerial vehicle (UAV) automatic navigation, where it aims to accurately and robustly estimate the relative altitude between the UAV and specific areas. However, most methods rely on auxiliary signal reception or expensive equipment, which are not always available, or applicable owing to signal interference, cost or power-consuming limitations in real application scenarios. In addition, fixed-wing UAVs have more complex kinematic models than vertical take-off and landing UAVs. Therefore, an altitude estimation method which can be robustly applied in a GPS denied environment for fixed-wing UAVs must be considered. In this paper, we present a method for high-precision altitude estimation that combines the vision information from a monocular camera and poses information from the inertial measurement unit (IMU) through a novel end-to-end deep neural network architecture. Our method has numerous advantages over existing approaches. First, we utilize the visual-inertial information and physics-based reasoning to build an ideal altitude model that provides general applicability and data efficiency for neural network learning. A further advantage is that we have designed a novel feature fusion module to simplify the tedious manual calibration and synchronization of the camera and IMU, which are required for the standard visual or visual-inertial methods to obtain the data association for altitude estimation modeling. Finally, the proposed method was evaluated, and validated using real flight data obtained during a fixed-wing UAV landing phase. The results show the average estimation error of our method is less than 3% of the actual altitude, which vastly improves the altitude estimation accuracy compared to other visual and visual-inertial based methods.


2018 ◽  
Vol 105 ◽  
pp. 175-181 ◽  
Author(s):  
Jon Ander Gómez ◽  
Juan Arévalo ◽  
Roberto Paredes ◽  
Jordi Nin

2020 ◽  
Vol 10 (1) ◽  
pp. 338 ◽  
Author(s):  
Paulo Lapa ◽  
Mauro Castelli ◽  
Ivo Gonçalves ◽  
Evis Sala ◽  
Leonardo Rundo

Prostate Cancer (PCa) is the most common oncological disease in Western men. Even though a growing effort has been carried out by the scientific community in recent years, accurate and reliable automated PCa detection methods on multiparametric Magnetic Resonance Imaging (mpMRI) are still a compelling issue. In this work, a Deep Neural Network architecture is developed for the task of classifying clinically significant PCa on non-contrast-enhanced MR images. In particular, we propose the use of Conditional Random Fields as a Recurrent Neural Network (CRF-RNN) to enhance the classification performance of XmasNet, a Convolutional Neural Network (CNN) architecture specifically tailored to the PROSTATEx17 Challenge. The devised approach builds a hybrid end-to-end trainable network, CRF-XmasNet, composed of an initial CNN component performing feature extraction and a CRF-based probabilistic graphical model component for structured prediction, without the need for two separate training procedures. Experimental results show the suitability of this method in terms of classification accuracy and training time, even though the high-variability of the observed results must be reduced before transferring the resulting architecture to a clinical environment. Interestingly, the use of CRFs as a separate postprocessing method achieves significantly lower performance with respect to the proposed hybrid end-to-end approach. The proposed hybrid end-to-end CRF-RNN approach yields excellent peak performance for all the CNN architectures taken into account, but it shows a high-variability, thus requiring future investigation on the integration of CRFs into a CNN.


Sensors ◽  
2020 ◽  
Vol 20 (17) ◽  
pp. 4965 ◽  
Author(s):  
Shoucong Xiong ◽  
Hongdi Zhou ◽  
Shuai He ◽  
Leilei Zhang ◽  
Qi Xia ◽  
...  

Accidental failures of rotating machinery components such as rolling bearings may trigger the sudden breakdown of the whole manufacturing system, thus, fault diagnosis is vital in industry to avoid these massive economical costs and casualties. Since convolutional neural networks (CNN) are poor in extracting reliable features from original signal data, the time-frequency analysis method is usually called for to transform 1D signal into a 2D time-frequency coefficient matrix in which richer information could be exposed more easily. However, realistic fault diagnosis applications face a dilemma in that signal time-frequency analysis and fault classification cannot be implemented together, which means manual signal conversion work is also needed, which reduces the integrity and robustness of the fault diagnosis method. In this paper, a novel network named WPT-CNN is proposed for end-to-end intelligent fault diagnosis of rolling bearings. WPT-CNN creatively uses the standard deep neural network structure to realize the wavelet packet transform (WPT) time-frequency analysis function, which seamlessly integrates fault diagnosis domain knowledge into deep learning algorithms. The overall network architecture can be trained with gradient descent backpropagation algorithms, indicating that the time-frequency analysis module of WPT-CNN is also able to learn the dataset characteristics, adaptively representing signal information in the most suitable way. Two experimental rolling bearing fault datasets were used to validate the proposed method. Testing results showed that WPT-CNN obtained the testing accuracies of 99.73% and 99.89%, respectively, in two datasets, which exhibited a better and more reliable diagnosis performance than any other existing deep learning and machine learning methods.


2021 ◽  
pp. 1-33
Author(s):  
Chihiro Watanabe ◽  
Hirokazu Kameoka

Abstract Deep neural networks (DNNs) have achieved substantial predictive performance in various speech processing tasks. Particularly, it has been shown that a monaural speech separation task can be successfully solved with a DNN-based method called deep clustering (DC), which uses a DNN to describe the process of assigning a continuous vector to each time-frequency (TF) bin and measure how likely each pair of TF bins is to be dominated by the same speaker. In DC, the DNN is trained so that the embedding vectors for the TF bins dominated by the same speaker are forced to get close to each other. One concern regarding DC is that the embedding process described by a DNN has a black-box structure, which is usually very hard to interpret. The potential weakness owing to the noninterpretable black box structure is that it lacks the flexibility of addressing the mismatch between training and test conditions (caused by reverberation, for instance). To overcome this limitation, in this letter, we propose the concept of explainable deep clustering (X-DC), whose network architecture can be interpreted as a process of fitting learnable spectrogram templates to an input spectrogram followed by Wiener filtering. During training, the elements of the spectrogram templates and their activations are constrained to be nonnegative, which facilitates the sparsity of their values and thus improves interpretability. The main advantage of this framework is that it naturally allows us to incorporate a model adaptation mechanism into the network thanks to its physically interpretable structure. We experimentally show that the proposed X-DC enables us to visualize and understand the clues for the model to determine the embedding vectors while achieving speech separation performance comparable to that of the original DC models.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Zhen Huang ◽  
Chengkang Li ◽  
Qiang Lv ◽  
Rijian Su ◽  
Kaibo Zhou

This paper implements a deep learning-based modulation pattern recognition algorithm for communication signals using a convolutional neural network architecture as a modulation recognizer. In this paper, a multiple-parallel complex convolutional neural network architecture is proposed to meet the demand of complex baseband processing of all-digital communication signals. The architecture learns the structured features of the real and imaginary parts of the baseband signal through parallel branches and fuses them at the output according to certain rules to obtain the final output, which realizes the fitting process to the complex numerical mapping. By comparing and analyzing several commonly used time-frequency analysis methods, a time-frequency analysis method that can well highlight the differences between different signal modulation patterns is selected to convert the time-frequency map into a digital image that can be processed by a deep network. In order to fully extract the spatial and temporal characteristics of the signal, the CLP algorithm of the CNN network and LSTM network in parallel is proposed. The CNN network and LSTM network are used to extract the spatial features and temporal features of the signal, respectively, and the fusion of the two features as well as the classification is performed. Finally, the optimal model and parameters are obtained through the design of the modulation recognizer based on the convolutional neural network and the performance analysis of the convolutional neural network model. The simulation experimental results show that the improved convolutional neural network can produce certain performance gains in radio signal modulation style recognition. This promotes the application of machine learning algorithms in the field of radio signal modulation pattern recognition.


Author(s):  
Houda Abouzid ◽  
Otman Chakkor

The most heard sound exists as a mixture of several audio sources. All human beings have the ability to concentrate on a single source of their interest and ignore the other sources as disturbing background noise. To apply this powerful gift to a machine, it must obligatory pass through the source separation process. If there is not enough information about the process of mixture of those sources and their nature as well, the problem is known by Blind Source Separation BSS. This thesis is dedicated to study the BSS as a solution for human machine interaction. The objective consists in recovering one or several source signals from a given mixture signal. Recently, the science research is towards artificial intelligence and machine learning applications. The proposed approach for the separation will be to apply a Deep Neural Network method based on Keras. Extracting features from the audio with signal processing techniques and machine learning to learn a representation from the audio for the compression tasks and the suppression of the noise will improve the state-of-the-art.


Sign in / Sign up

Export Citation Format

Share Document