Unsupervised single-channel speech enhancement based on phase aware time-frequency mask estimation

Abstract. To enhance extreme corrupted speech signals, an Improved Psychoacoustically Motivated Spectral Weighting Rule (IPMSWR) is proposed, that controls the predefined residual noise level by a time-frequency dependent parameter. Unlike conventional Psychoacoustically Motivated Spectral Weighting Rules (PMSWR), the level of the residual noise is here varied throughout the enhanced speech based on the discrimination between the regions with speech presence and speech absence by means of segmental SNR within critical bands. Controlling in such a way the level of the residual noise in the noise only region avoids the unpleasant residual noise perceived at very low SNRs. To derive the gain coefficients, the computation of the masking curve and the estimation of the corrupting noise power are required. Since the clean speech is generally not available for a single channel speech enhancement technique, the rough clean speech components needed to compute the masking curve are here obtained using advanced spectral subtraction techniques. To estimate the corrupting noise, a new technique is employed, that relies on the noise power estimation using rapid adaptation and recursive smoothing principles. The performances of the proposed approach are objectively and subjectively compared to the conventional approaches to highlight the aforementioned improvement.

Download Full-text

Speech enhancement algorithm of improved OMLSA based on bilateral spectrogram filtering

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-192088 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6881-6889

Author(s):

Jie Wang ◽

Linhuang Yan ◽

Jiayi Tian ◽

Minmin Yuan

Keyword(s):

Speech Enhancement ◽

Visual Processing ◽

Single Channel ◽

Signal To Noise Ratio ◽

Spectral Amplitude ◽

Signal To Noise ◽

Noisy Speech ◽

Time Frequency ◽

Perceptual Evaluation ◽

Noise Ratio

In this paper, a bilateral spectrogram filtering (BSF)-based optimally modified log-spectral amplitude (OMLSA) estimator for single-channel speech enhancement is proposed, which can significantly improve the performance of OMLSA, especially in highly non-stationary noise environments, by taking advantage of bilateral filtering (BF), a widely used technology in image and visual processing, to preprocess the spectrogram of the noisy speech. BSF is capable of not only sharpening details, removing unwanted textures or background noise from the noisy speech spectrogram, but also preserving edges when considering a speech spectrogram as an image. The a posteriori signal-to-noise ratio (SNR) of OMLSA algorithm is estimated after applying BSF to the noisy speech. Besides, in order to reduce computing costs, a fast and accurate BF is adopted to reduce the algorithm complexity O(1) for each time-frequency bin. Finally, the proposed algorithm is compared with the original OMLSA and other classic denoising methods using various types of noise with different signal-to-noise ratios in terms of objective evaluation metrics such as segmental signal-to-noise ratio improvement and perceptual evaluation of speech quality. The results show the validity of the improved BSF-based OMLSA algorithm.

Download Full-text

A novel single channel speech enhancement using time frequency mask

2012 International Conference on Computer Science and Information Processing (CSIP) ◽

10.1109/csip.2012.6308996 ◽

2012 ◽

Author(s):

Gongxian Sun ◽

Ming Xiao ◽

Feng Gao

Keyword(s):

Speech Enhancement ◽

Single Channel ◽

Time Frequency

Download Full-text

PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6489 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9458-9465

Author(s):

Dacheng Yin ◽

Chong Luo ◽

Zhiwei Xiong ◽

Wenjun Zeng

Keyword(s):

Speech Enhancement ◽

Deep Neural Network ◽

Single Channel ◽

Stream Network ◽

Frequency Transformation ◽

Large Margin ◽

Long Range Correlations ◽

Time Frequency ◽

Phase Prediction ◽

Frequency Axis

Time-frequency (T-F) domain masking is a mainstream approach for single-channel speech enhancement. Recently, focuses have been put to phase prediction in addition to amplitude prediction. In this paper, we propose a phase-and-harmonics-aware deep neural network (DNN), named PHASEN, for this task. Unlike previous methods which directly use a complex ideal ratio mask to supervise the DNN learning, we design a two-stream network, where amplitude stream and phase stream are dedicated to amplitude and phase prediction. We discover that the two streams should communicate with each other, and this is crucial to phase prediction. In addition, we propose frequency transformation blocks to catch long-range correlations along the frequency axis. Visualization shows that the learned transformation matrix implicitly captures the harmonic correlation, which has been proven to be helpful for T-F spectrogram reconstruction. With these two innovations, PHASEN acquires the ability to handle detailed phase patterns and to utilize harmonic patterns, getting 1.76dB SDR improvement on AVSpeech + AudioSet dataset. It also achieves significant gains over Google's network on this dataset. On Voice Bank + DEMAND dataset, PHASEN outperforms previous methods by a large margin on four metrics.

Download Full-text

Time-Frequency Masking Strategies for Single-Channel Low-Latency Speech Enhancement Using Neural Networks

2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC) ◽

10.1109/iwaenc.2018.8521400 ◽

2018 ◽

Cited By ~ 1

Author(s):

Mikko Parviainen ◽

Pasi Pertila ◽

Tuomas Virtanen ◽

Peter Grosche

Keyword(s):

Neural Networks ◽

Speech Enhancement ◽

Single Channel ◽

Low Latency ◽

Time Frequency

Download Full-text

Time-frequency constraints for phase estimation in single-channel speech enhancement

2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC) ◽

10.1109/iwaenc.2014.6954314 ◽

2014 ◽

Cited By ~ 14

Author(s):

Pejman Mowlaee ◽

Rahim Saeidi

Keyword(s):

Speech Enhancement ◽

Single Channel ◽

Phase Estimation ◽

Frequency Constraints ◽

Time Frequency

Download Full-text

Joint Time-Frequency and Time Domain Learning for Speech Enhancement

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/528 ◽

2020 ◽

Author(s):

Chuanxin Tang ◽

Chong Luo ◽

Zhiyuan Zhao ◽

Wenxuan Xie ◽

Wenjun Zeng

Keyword(s):

Speech Enhancement ◽

Time Domain ◽

Single Channel ◽

State Of The Art ◽

Performance Gain ◽

Time Frequency ◽

Cross Domain ◽

Frequency Domain Methods ◽

Domain Learning ◽

Pros And Cons

For single-channel speech enhancement, both time-domain and time-frequency-domain methods have their respective pros and cons. In this paper, we present a cross-domain framework named TFT-Net, which takes time-frequency spectrogram as input and produces time-domain waveform as output. Such a framework takes advantage of the knowledge we have about spectrogram and avoids some of the drawbacks that T-F-domain methods have been suffering from. In TFT-Net, we design an innovative dual-path attention block (DAB) to fully exploit correlations along the time and frequency axes. We further discover that a sample-independent DAB (SDAB) achieves a good tradeoff between enhanced speech quality and complexity. Ablation studies show that both the cross-domain design and the SDAB block bring large performance gain. When logarithmic MSE is used as the training criteria, TFT-Net achieves the highest SDR and SSNR among state-of-the-art methods on two major speech enhancement benchmarks.

Download Full-text