On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

10.36227/techrxiv.13012760 ◽

2020 ◽

Author(s):

Aaron Nicolson ◽

Kuldip K. Paliwal

Keyword(s):

Deep Learning ◽

Minimum Mean Square Error ◽

Auditory Scene Analysis ◽

Spectrum Estimation ◽

Learning Approaches ◽

Computational Auditory Scene Analysis ◽

Convolutional Network ◽

Magnitude Spectrum ◽

Front End ◽

Asr System

The estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three main categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) training targets. In this study, we aim to determine which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is most suitable as a front-end for robust ASR. The training targets were evaluated using a temporal convolutional network (TCN) on the DEMAND Voice Bank and Deep Xi datasets---which include real-world non-stationary and coloured noise sources at multiple SNR levels. Seven objective measures were used, including the word error rate (WER) of the Deep Speech ASR system. We find that MMSE training targets produce the highest objective quality scores. We also find that CASA training targets, in particular the ideal ratio mask (IRM), produce the highest intelligibility scores and perform best as a front-end for robust ASR.

Download Full-text

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

10.36227/techrxiv.13012760.v1 ◽

2020 ◽

Author(s):

Aaron Nicolson ◽

Kuldip K. Paliwal

Keyword(s):

Deep Learning ◽

Minimum Mean Square Error ◽

Auditory Scene Analysis ◽

Spectrum Estimation ◽

Learning Approaches ◽

Computational Auditory Scene Analysis ◽

Convolutional Network ◽

Magnitude Spectrum ◽

Front End ◽

Asr System

The estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three main categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) training targets. In this study, we aim to determine which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is most suitable as a front-end for robust ASR. The training targets were evaluated using a temporal convolutional network (TCN) on the DEMAND Voice Bank and Deep Xi datasets---which include real-world non-stationary and coloured noise sources at multiple SNR levels. Seven objective measures were used, including the word error rate (WER) of the Deep Speech ASR system. We find that MMSE training targets produce the highest objective quality scores. We also find that CASA training targets, in particular the ideal ratio mask (IRM), produce the highest intelligibility scores and perform best as a front-end for robust ASR.

Download Full-text

On training targets for deep learning approaches to clean speech magnitude spectrum estimation

The Journal of the Acoustical Society of America ◽

10.1121/10.0004823 ◽

2021 ◽

Vol 149 (5) ◽

pp. 3273-3293

Author(s):

Aaron Nicolson ◽

Kuldip K. Paliwal

Keyword(s):

Deep Learning ◽

Spectrum Estimation ◽

Learning Approaches ◽

Magnitude Spectrum

Download Full-text

Speech Enhancement Using Deep Learning Methods: A Review

Jurnal Elektronika dan Telekomunikasi ◽

10.14203/jet.v21.19-26 ◽

2021 ◽

Vol 21 (1) ◽

pp. 19

Author(s):

Asri Rizki Yuliani ◽

M. Faizal Amri ◽

Endang Suryawati ◽

Ade Ramdan ◽

Hilman Ferdinandus Pardede

Keyword(s):

Neural Network ◽

Deep Learning ◽

Speech Enhancement ◽

Speech Signal ◽

Research Field ◽

Learning Technologies ◽

Learning Approaches ◽

Speech Signal Processing ◽

Generative Adversarial Network ◽

Advantages And Disadvantages

Speech enhancement, which aims to recover the clean speech of the corrupted signal, plays an important role in the digital speech signal processing. According to the type of degradation and noise in the speech signal, approaches to speech enhancement vary. Thus, the research topic remains challenging in practice, specifically when dealing with highly non-stationary noise and reverberation. Recent advance of deep learning technologies has provided great support for the progress in speech enhancement research field. Deep learning has been known to outperform the statistical model used in the conventional speech enhancement. Hence, it deserves a dedicated survey. In this review, we described the advantages and disadvantages of recent deep learning approaches. We also discussed challenges and trends of this field. From the reviewed works, we concluded that the trend of the deep learning architecture has shifted from the standard deep neural network (DNN) to convolutional neural network (CNN), which can efficiently learn temporal information of speech signal, and generative adversarial network (GAN), that utilize two networks training.

Download Full-text

A Higher Intelligibility Speech-Enhancement Algorithm

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.321-324.1075 ◽

2013 ◽

Vol 321-324 ◽

pp. 1075-1079

Author(s):

Peng Liu ◽

Jian Fen Ma

Keyword(s):

Speech Enhancement ◽

Speech Intelligibility ◽

Minimum Mean Square Error ◽

A Priori ◽

Objective Evaluation ◽

Mean Square ◽

Magnitude Spectrum ◽

Gain Matrix ◽

Speech Distortion ◽

Enhancement Algorithm

A higher intelligibility speech-enhancement algorithm based on subspace is proposed. The majority existing speech-enhancement algorithms cannot effectively improve enhanced speech intelligibility. One important reason is that they only use Minimum Mean Square Error (MMSE) to constrain speech distortion but ignore that speech distortion region differences have a significant effect on intelligibility. A priori Signal Noise Ratio (SNR) and gain matrix were used to determine the distortion region. Then the gain matrix was modified to constrain the magnitude spectrum of the amplification distortion in excess of 6.02 dB which damages intelligibility much. Both objective evaluation and subjective audition show that the proposed algorithm does improve the enhanced speech intelligibility.

Download Full-text

DeepLPC-MHANet: Multi-Head Self-Attention for Augmented Kalman Filter-based Speech Enhancement

10.36227/techrxiv.14384909 ◽

2021 ◽

Author(s):

Sujan Kumar Roy ◽

Aaron Nicolson ◽

Kuldip K. Paliwal

Keyword(s):

Deep Learning ◽

Kalman Filter ◽

Speech Enhancement ◽

Linear Prediction ◽

Power Spectra ◽

Previous Method ◽

Learning Approaches ◽

Convolutional Network ◽

Listening Tests ◽

Prediction Coefficient

Current augmented Kalman filter (AKF)-based speech enhancement algorithms utilise a temporal convolutional network (TCN) to estimate the clean speech and noise linear prediction coefficient (LPC). However, the multi-head attention network (MHANet) has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than TCNs. Motivated by this, we investigate the MHANet for LPC estimation. We aim to produce clean speech and noise LPC parameters with the least bias to date. With this, we also aim to produce higher quality and more intelligible enhanced speech than any current KF or AKF-based SEA. Here, we investigate MHANet within the DeepLPC framework. DeepLPC is a deep learning framework for jointly estimating the clean speech and noise LPC power spectra. DeepLPC is selected as it exhibits significantly less bias than other frameworks, by avoiding the use of whitening filters and post-processing. DeepLPC-MHANet is evaluated on the NOIZEUS corpus using subjective AB listening tests, as well as seven different objective measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR). DeepLPC-MHANet is compared to five existing deep learning-based methods. Compared to other deep learning approaches, DeepLPC-MHANet produced clean speech LPC estimates with the least amount of bias. DeepLPC-MHANet-AKF also produced higher objective scores than any of the competing methods (with an improvement of 0.17 for CSIG, 0.15 for CBAK, 0.19 for COVL, 0.24 for PESQ, 3.70\% for STOI, 1.03 dB for SegSNR, and 1.04 dB for SI-SDR over the next best method). The enhanced speech produced by DeepLPC-MHANet-AKF was also the most preferred amongst ten listeners. By producing LPC estimates with the least amount of bias to date, DeepLPC-MHANet enables the AKF to produce enhanced speech at a higher quality and intelligibility than any previous method.

Download Full-text

A Natural Images Pre-Trained Deep Learning Method for Seismic Random Noise Attenuation

Remote Sensing ◽

10.3390/rs14020263 ◽

2022 ◽

Vol 14 (2) ◽

pp. 263

Author(s):

Haixia Zhao ◽

Tingting Bai ◽

Zhiqiang Wang

Keyword(s):

Deep Learning ◽

Seismic Data ◽

Field Data ◽

Noise Suppression ◽

Signal To Noise Ratio ◽

Random Noise ◽

Natural Images ◽

Training Data ◽

Learning Approaches ◽

Learning Method

Seismic field data are usually contaminated by random or complex noise, which seriously affect the quality of seismic data contaminating seismic imaging and seismic interpretation. Improving the signal-to-noise ratio (SNR) of seismic data has always been a key step in seismic data processing. Deep learning approaches have been successfully applied to suppress seismic random noise. The training examples are essential in deep learning methods, especially for the geophysical problems, where the complete training data are not easy to be acquired due to high cost of acquisition. In this work, we propose a natural images pre-trained deep learning method to suppress seismic random noise through insight of the transfer learning. Our network contains pre-trained and post-trained networks: the former is trained by natural images to obtain the preliminary denoising results, while the latter is trained by a small amount of seismic images to fine-tune the denoising effects by semi-supervised learning to enhance the continuity of geological structures. The results of four types of synthetic seismic data and six field data demonstrate that our network has great performance in seismic random noise suppression in terms of both quantitative metrics and intuitive effects.

Download Full-text

Improved Performance in the Detection of ACO-OFDM Modulated Signals Using Deep Learning Modules

Applied Sciences ◽

10.3390/app10238380 ◽

2020 ◽

Vol 10 (23) ◽

pp. 8380

Author(s):

Laialy Darwesh ◽

Natan Kopeika

Keyword(s):

Deep Learning ◽

Orthogonal Frequency Division Multiplexing ◽

Signal To Noise Ratio ◽

Minimum Mean Square Error ◽

Optical Power ◽

Detection Methods ◽

Free Space Optical ◽

Strong Turbulence ◽

Energy Efficiency Improvement ◽

Mmse Estimator

Free space optical communication (FSO) is widely deployed to transmit high data rates for rapid communication traffic increase. Asymmetrically clipped optical orthogonal frequency division multiplexing (ACO-OFDM) modulation is a very efficient FSO communication technique in terms of transmitted optical power. However, its performance is limited by atmospheric turbulence. When the channel includes strong turbulence or is non-deterministic, the bit error rate (BER) increases. To reach optimal performance, the ACO-OFDM decoder needs to know accurate channel state information (CSI). We propose novel detection using different deep learning (DL) algorithms. Our DL models are compared with minimum mean square error (MMSE) detection methods in different turbulent channels and improve performance especially for non-stationary and non-deterministic channels. Our models yield performance very close to that of the MMSE estimator when the channel is characterized by weak or strong turbulence and is stationary. However, when the channel is non-stationary and variable, our DL model succeeds in improving the performance of the system and decreasing the signal to noise ratio (SNR) by more than 8 dB compared to that of the MMSE estimator, and it succeeds in recovering the received data without needing to know accurate CSI. Our DL decoders also show notable speed and energy efficiency improvement.

Download Full-text

Baby Cry Detection: Deep Learning and Classical Approaches

10.31234/osf.io/p8xgm ◽

2019 ◽

Author(s):

Rami Cohen ◽

Dima Ruinskiy ◽

Janis Zickfeld ◽

Hans IJzerman ◽

Yizhar Lavner

Keyword(s):

Neural Network ◽

Deep Learning ◽

Social Research ◽

Signal To Noise Ratio ◽

Support Vector ◽

Temporal Behavior ◽

Learning Approaches ◽

Signal To Noise ◽

Vector Machines ◽

Domestic Environments

In this chapter, we compare deep learning and classical approaches for detection of baby cry sounds in various domestic environments under challenging signal-to-noise ratio conditions. Automatic cry detection has applications in commercial products (such as baby remote monitors) as well as in medical and psycho-social research. We design and evaluate several convolutional neural network (CNN) architectures for baby cry detection, and compare their performance to that of classical machine-learning approaches, such as logistic regression and support vector machines. In addition to feed-forward CNNs, we analyze the performance of recurrent neural network (RNN) architectures, which are able to capture temporal behavior of acoustic events. We show that by carefully designing CNN architectures with specialized non-symmetric kernels, better results are obtained compared to common CNN architectures.

Download Full-text

Deep learning for minimum mean-square error approaches to speech enhancement

Speech Communication ◽

10.1016/j.specom.2019.06.002 ◽

2019 ◽

Vol 111 ◽

pp. 44-55 ◽

Cited By ~ 20

Author(s):

Aaron Nicolson ◽

Kuldip K. Paliwal

Keyword(s):

Deep Learning ◽

Mean Square Error ◽

Speech Enhancement ◽

Minimum Mean Square Error ◽

Mean Square

Download Full-text