On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

10.36227/techrxiv.13012760.v1 ◽

2020 ◽

Author(s):

Aaron Nicolson ◽

Kuldip K. Paliwal

Keyword(s):

Deep Learning ◽

Minimum Mean Square Error ◽

Auditory Scene Analysis ◽

Spectrum Estimation ◽

Learning Approaches ◽

Computational Auditory Scene Analysis ◽

Convolutional Network ◽

Magnitude Spectrum ◽

Front End ◽

Asr System

The estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three main categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) training targets. In this study, we aim to determine which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is most suitable as a front-end for robust ASR. The training targets were evaluated using a temporal convolutional network (TCN) on the DEMAND Voice Bank and Deep Xi datasets---which include real-world non-stationary and coloured noise sources at multiple SNR levels. Seven objective measures were used, including the word error rate (WER) of the Deep Speech ASR system. We find that MMSE training targets produce the highest objective quality scores. We also find that CASA training targets, in particular the ideal ratio mask (IRM), produce the highest intelligibility scores and perform best as a front-end for robust ASR.

Download Full-text

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

10.36227/techrxiv.13012760.v2 ◽

2021 ◽

Author(s):

Aaron Nicolson ◽

Kuldip K. Paliwal

Keyword(s):

Deep Learning ◽

Speech Enhancement ◽

Signal To Noise Ratio ◽

Minimum Mean Square Error ◽

Auditory Scene Analysis ◽

Spectrum Estimation ◽

Learning Approaches ◽

Magnitude Spectrum ◽

Front End ◽

Asr System

Estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) estimator training targets. The choice of training target can have a significant impact on speech enhancement/separation and robust ASR performance. Motivated by this, we find which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is best for an ASR front-end. Three different deep neural network (DNN) types and two datasets that include real-world non-stationary and coloured noise sources at multiple SNR levels were used for evaluation. Ten objective measures were employed, including the word error rate (WER) of the Deep Speech ASR system. We find that training targets that estimate the <i>a priori</i> signal-to-noise ratio (SNR) for MMSE estimators produce the highest objective quality scores. Moreover, we find that the gain of MMSE estimators and the ideal amplitude mask (IAM) produce the highest objective intelligibility scores and are most suitable for an ASR front-end.

Download Full-text

On training targets for deep learning approaches to clean speech magnitude spectrum estimation

The Journal of the Acoustical Society of America ◽

10.1121/10.0004823 ◽

2021 ◽

Vol 149 (5) ◽

pp. 3273-3293

Author(s):

Aaron Nicolson ◽

Kuldip K. Paliwal

Keyword(s):

Deep Learning ◽

Spectrum Estimation ◽

Learning Approaches ◽

Magnitude Spectrum

Download Full-text

Deep-DRM: a computational method for identifying disease-related metabolites based on graph deep learning approaches

Briefings in Bioinformatics ◽

10.1093/bib/bbaa212 ◽

2020 ◽

Author(s):

Tianyi Zhao ◽

Yang Hu ◽

Liang Cheng

Keyword(s):

Deep Learning ◽

Chemical Structure ◽

Computational Method ◽

Learning Approaches ◽

Convolutional Network ◽

Functional Changes ◽

Chemical Structures ◽

Association Pattern ◽

Semantic Associations ◽

Components Analysis

Abstract Motivation: The functional changes of the genes, RNAs and proteins will eventually be reflected in the metabolic level. Increasing number of researchers have researched mechanism, biomarkers and targeted drugs by metabolites. However, compared with our knowledge about genes, RNAs, and proteins, we still know few about diseases-related metabolites. All the few existed methods for identifying diseases-related metabolites ignore the chemical structure of metabolites, fail to recognize the association pattern between metabolites and diseases, and fail to apply to isolated diseases and metabolites. Results: In this study, we present a graph deep learning based method, named Deep-DRM, for identifying diseases-related metabolites. First, chemical structures of metabolites were used to calculate similarities of metabolites. The similarities of diseases were obtained based on their functional gene network and semantic associations. Therefore, both metabolites and diseases network could be built. Next, Graph Convolutional Network (GCN) was applied to encode the features of metabolites and diseases, respectively. Then, the dimension of these features was reduced by Principal components analysis (PCA) with retainment 99% information. Finally, Deep neural network was built for identifying true metabolite-disease pairs (MDPs) based on these features. The 10-cross validations on three testing setups showed outstanding AUC (0.952) and AUPR (0.939) of Deep-DRM compared with previous methods and similar approaches. Ten of top 15 predicted associations between diseases and metabolites got support by other studies, which suggests that Deep-DRM is an efficient method to identify MDPs. Contact: [email protected]. Availability and implementation: https://github.com/zty2009/GPDNN-for-Identify-ing-Disease-related-Metabolites.

Download Full-text

HMM-Based Mask Estimation for a Speech Recognition Front-End Using Computational Auditory Scene Analysis

2008 Hands-Free Speech Communication and Microphone Arrays ◽

10.1109/hscma.2008.4538715 ◽

2008 ◽

Author(s):

Ji Hun Park ◽

Jae Sam Yoon ◽

Hong Kook Kim

Keyword(s):

Speech Recognition ◽

Auditory Scene Analysis ◽

Scene Analysis ◽

Computational Auditory Scene Analysis ◽

Front End ◽

Auditory Scene ◽

Mask Estimation

Download Full-text

Time Series Classification with InceptionFCN

Sensors ◽

10.3390/s22010157 ◽

2021 ◽

Vol 22 (1) ◽

pp. 157

Author(s):

Saidrasul Usmankhujaev ◽

Bunyodbek Ibrokhimov ◽

Shokhrukh Baydadaev ◽

Jangwoo Kwon

Keyword(s):

Time Series ◽

Deep Learning ◽

Deep Neural Networks ◽

Learning Approaches ◽

Time Series Classification ◽

Convolutional Network ◽

Training Time ◽

Learning Techniques ◽

Significant Research ◽

Previous State

Deep neural networks (DNN) have proven to be efficient in computer vision and data classification with an increasing number of successful applications. Time series classification (TSC) has been one of the challenging problems in data mining in the last decade, and significant research has been proposed with various solutions, including algorithm-based approaches as well as machine and deep learning approaches. This paper focuses on combining the two well-known deep learning techniques, namely the Inception module and the Fully Convolutional Network. The proposed method proved to be more efficient than the previous state-of-the-art InceptionTime method. We tested our model on the univariate TSC benchmark (the UCR/UEA archive), which includes 85 time-series datasets, and proved that our network outperforms the InceptionTime in terms of the training time and overall accuracy on the UCR archive.

Download Full-text

DeepLPC-MHANet: Multi-Head Self-Attention for Augmented Kalman Filter-based Speech Enhancement

10.36227/techrxiv.14384909 ◽

2021 ◽

Author(s):

Sujan Kumar Roy ◽

Aaron Nicolson ◽

Kuldip K. Paliwal

Keyword(s):

Deep Learning ◽

Kalman Filter ◽

Speech Enhancement ◽

Linear Prediction ◽

Power Spectra ◽

Previous Method ◽

Learning Approaches ◽

Convolutional Network ◽

Listening Tests ◽

Prediction Coefficient

Current augmented Kalman filter (AKF)-based speech enhancement algorithms utilise a temporal convolutional network (TCN) to estimate the clean speech and noise linear prediction coefficient (LPC). However, the multi-head attention network (MHANet) has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than TCNs. Motivated by this, we investigate the MHANet for LPC estimation. We aim to produce clean speech and noise LPC parameters with the least bias to date. With this, we also aim to produce higher quality and more intelligible enhanced speech than any current KF or AKF-based SEA. Here, we investigate MHANet within the DeepLPC framework. DeepLPC is a deep learning framework for jointly estimating the clean speech and noise LPC power spectra. DeepLPC is selected as it exhibits significantly less bias than other frameworks, by avoiding the use of whitening filters and post-processing. DeepLPC-MHANet is evaluated on the NOIZEUS corpus using subjective AB listening tests, as well as seven different objective measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR). DeepLPC-MHANet is compared to five existing deep learning-based methods. Compared to other deep learning approaches, DeepLPC-MHANet produced clean speech LPC estimates with the least amount of bias. DeepLPC-MHANet-AKF also produced higher objective scores than any of the competing methods (with an improvement of 0.17 for CSIG, 0.15 for CBAK, 0.19 for COVL, 0.24 for PESQ, 3.70\% for STOI, 1.03 dB for SegSNR, and 1.04 dB for SI-SDR over the next best method). The enhanced speech produced by DeepLPC-MHANet-AKF was also the most preferred amongst ten listeners. By producing LPC estimates with the least amount of bias to date, DeepLPC-MHANet enables the AKF to produce enhanced speech at a higher quality and intelligibility than any previous method.

Download Full-text

HMM-Based Mask Estimation for a Speech Recognition Front-End Using Computational Auditory Scene Analysis

IEICE Transactions on Information and Systems ◽

10.1093/ietisy/e91-d.9.2360 ◽

2008 ◽

Vol E91-D (9) ◽

pp. 2360-2364 ◽

Cited By ~ 3

Author(s):

J. H. PARK ◽

J. S. YOON ◽

H. K. KIM

Keyword(s):

Speech Recognition ◽

Auditory Scene Analysis ◽

Scene Analysis ◽

Computational Auditory Scene Analysis ◽

Front End ◽

Auditory Scene ◽

Mask Estimation

Download Full-text

Motif-Matching Based Subgraph-Level Attentional Convolutional Network for Graph Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5987 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5387-5394

Author(s):

Hao Peng ◽

Jianxin Li ◽

Qiran Gong ◽

Yuanxin Ning ◽

Senzhang Wang ◽

...

Keyword(s):

Deep Learning ◽

Social Network ◽

Spatial Information ◽

Structural Information ◽

Feature Learning ◽

Drug Analysis ◽

Classification Performance ◽

Learning Approaches ◽

Graph Classification ◽

Convolutional Network

Graph classification is critically important to many real-world applications that are associated with graph data such as chemical drug analysis and social network mining. Traditional methods usually require feature engineering to extract the graph features that can help discriminate the graphs of different classes. Although recently deep learning based graph embedding approaches are proposed to automatically learn graph features, they mostly use a few vertex arrangements extracted from the graph for feature learning, which may lose some structural information. In this work, we present a novel motif-based attentional graph convolution neural network for graph classification, which can learn more discriminative and richer graph features. Specifically, a motif-matching guided subgraph normalization method is developed to better preserve the spatial information. A novel subgraph-level self-attention network is also proposed to capture the different impacts or weights of different subgraphs. Experimental results on both bioinformatics and social network datasets show that the proposed models significantly improve graph classification performance over both traditional graph kernel methods and recent deep learning approaches.

Download Full-text

AxonDeep: Automated Optic Nerve Axon Segmentation in Mice with Deep Learning.

10.1101/2021.05.21.445196 ◽

2021 ◽

Author(s):

Wenxiang Deng ◽

Adam Hedberg-Buenz ◽

Dana A Soukup ◽

Sima Taghizadeh ◽

Michael G Anderson ◽

...

Keyword(s):

Deep Learning ◽

Optic Nerve ◽

Cross Sections ◽

Vision Loss ◽

Learning Approaches ◽

Generative Adversarial Network ◽

Convolutional Network ◽

Adversarial Network ◽

Automated Quantification ◽

Manual Performance

Purpose: Optic nerve damage is the principal feature of glaucoma and contributes to vision loss in many diseases. In animal models, nerve health has traditionally been assessed by human experts that grade damage qualitatively or manually quantify axons from sampling limited areas from histologic cross sections of nerve. Both approaches are prone to variability and time consuming. Automated approaches have begun to emerge, but shortcomings have limited wide-spread application. Here, we seek improvements through use of deep-learning approaches for segmenting and quantifying axons from cross sections of mouse optic nerve. Methods: Two deep-learning approaches were developed and evaluated: (1) a traditional supervised approach using a fully convolutional network trained with only labeled data and (2) a semi-supervised approach trained with both labeled and unlabeled data using a generative-adversarial-network framework. Results: From comparisons with an independent test set of images with manually marked axon centers and boundaries, both deep-learning approaches performed above an existing baseline automated approach and similarly to two independent experts. Performance of the semi-supervised approach was superior and implemented into AxonDeep. Conclusions: AxonDeep performs automated quantification and segmentation of axons similar to that of experts without the time- and labor-constraints associated with manual performance. The quantitative and objective nature of AxonDeep reduces variability arising from differences in model, methodology, and user that often compromise manual performance of these tasks. Translational Relevance: Use of deep learning for axon quantification provides rapid, objective, and higher throughput analysis of optic nerve that would otherwise not be possible.

Download Full-text