computational auditory scene analysis Latest Research Papers

The estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three main categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) training targets. In this study, we aim to determine which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is most suitable as a front-end for robust ASR. The training targets were evaluated using a temporal convolutional network (TCN) on the DEMAND Voice Bank and Deep Xi datasets---which include real-world non-stationary and coloured noise sources at multiple SNR levels. Seven objective measures were used, including the word error rate (WER) of the Deep Speech ASR system. We find that MMSE training targets produce the highest objective quality scores. We also find that CASA training targets, in particular the ideal ratio mask (IRM), produce the highest intelligibility scores and perform best as a front-end for robust ASR.

Download Full-text

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

10.36227/techrxiv.13012760.v1 ◽

2020 ◽

Author(s):

Aaron Nicolson ◽

Kuldip K. Paliwal

Keyword(s):

Deep Learning ◽

Minimum Mean Square Error ◽

Auditory Scene Analysis ◽

Spectrum Estimation ◽

Learning Approaches ◽

Computational Auditory Scene Analysis ◽

Convolutional Network ◽

Magnitude Spectrum ◽

Front End ◽

Asr System

The estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three main categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) training targets. In this study, we aim to determine which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is most suitable as a front-end for robust ASR. The training targets were evaluated using a temporal convolutional network (TCN) on the DEMAND Voice Bank and Deep Xi datasets---which include real-world non-stationary and coloured noise sources at multiple SNR levels. Seven objective measures were used, including the word error rate (WER) of the Deep Speech ASR system. We find that MMSE training targets produce the highest objective quality scores. We also find that CASA training targets, in particular the ideal ratio mask (IRM), produce the highest intelligibility scores and perform best as a front-end for robust ASR.

Download Full-text

Robot Audition and Computational Auditory Scene Analysis

Advanced Intelligent Systems ◽

10.1002/aisy.202000050 ◽

2020 ◽

Vol 2 (9) ◽

pp. 2000050

Author(s):

Kazuhiro Nakadai ◽

Hiroshi G. Okuno

Keyword(s):

Auditory Scene Analysis ◽

Scene Analysis ◽

Computational Auditory Scene Analysis ◽

Robot Audition ◽

Auditory Scene

Download Full-text

Building Health Monitoring Using Computational Auditory Scene Analysis

2020 16th International Conference on Distributed Computing in Sensor Systems (DCOSS) ◽

10.1109/dcoss49796.2020.00033 ◽

2020 ◽

Author(s):

Mitsuru Kawamoto ◽

Takuji Hamamoto

Keyword(s):

Health Monitoring ◽

Auditory Scene Analysis ◽

Scene Analysis ◽

Computational Auditory Scene Analysis ◽

Auditory Scene ◽

Building Health Monitoring

Download Full-text

Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp40776.2020.9053396 ◽

2020 ◽

Author(s):

Qiuqiang Kong ◽

Yuxuan Wang ◽

Xuchen Song ◽

Yin Cao ◽

Wenwu Wang ◽

...

Keyword(s):

Source Separation ◽

Auditory Scene Analysis ◽

Scene Analysis ◽

Computational Auditory Scene Analysis ◽

Auditory Scene

Download Full-text

Separation of Reverberant Speech Based on Computational Auditory Scene Analysis

Automatic Control and Computer Sciences ◽

10.3103/s0146411618060068 ◽

2018 ◽

Vol 52 (6) ◽

pp. 561-571

Author(s):

Li Hongyan ◽

Cao Meng ◽

Wang Yue

Keyword(s):

Auditory Scene Analysis ◽

Scene Analysis ◽

Computational Auditory Scene Analysis ◽

Auditory Scene ◽

Reverberant Speech

Download Full-text

An Unsupervised Two-Talker Speech Separation System Based on CASA

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001418580028 ◽

2018 ◽

Vol 32 (07) ◽

pp. 1858002 ◽

Cited By ~ 2

Author(s):

Hongyan Li ◽

Yue Wang ◽

Rongrong Zhao ◽

Xueying Zhang

Keyword(s):

Speaker Recognition ◽

Auditory Scene Analysis ◽

Beam Search ◽

Speech Separation ◽

Separation System ◽

Blind Separation ◽

Computational Auditory Scene Analysis ◽

Auditory Scene ◽

Voiced Speech ◽

And Performance

On the basis of the theory about blind separation of monaural speech based on computational auditory scene analysis (CASA), a two-talker speech separation system combining CASA and speaker recognition was proposed to separate speech from other speech interferences in this paper. First, a tandem algorithm is used to organize voiced speech, then based on the clustering of gammatone frequency cepstral coefficients (GFCCs), an object function is established to recognize the speaker, and the best group is achieved through exhaustive search or beam search, so that voiced speech is organized sequentially. Second, unvoiced segments are generated by estimating onset/offset, and then unvoiced–voiced (U–V) segments and unvoiced–unvoiced (U–U) segments are separated respectively. The U–V segments are managed via the binary mask of the separated voiced speech, while the U–V segments are separated evenly. So far the unvoiced segments are separated. The simulation and performance evaluation verify the feasibility and effectiveness of the proposed algorithm.

Download Full-text

Post-Processing for the Mask of Computational Auditory Scene Analysis in Monaural Speech Segregation

Computación y Sistemas ◽

10.13053/cys-21-4-2846 ◽

2018 ◽

Vol 21 (4) ◽

Author(s):

Wen-Hsing Lai ◽

Cheng-Jia Yang ◽

Siou-Lin Wang

Keyword(s):

Auditory Scene Analysis ◽

Scene Analysis ◽

Post Processing ◽

Computational Auditory Scene Analysis ◽

Speech Segregation ◽

Auditory Scene

Download Full-text

Sound-environment monitoring technique based on computational auditory scene analysis

2017 25th European Signal Processing Conference (EUSIPCO) ◽

10.23919/eusipco.2017.8081664 ◽

2017 ◽

Cited By ~ 4

Author(s):

Mitsuru Kawamoto

Keyword(s):

Auditory Scene Analysis ◽

Scene Analysis ◽

Environment Monitoring ◽

Computational Auditory Scene Analysis ◽

Monitoring Technique ◽

Auditory Scene ◽

Sound Environment

Download Full-text

computational auditory scene analysis
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX