Nexus DNN for Speech and Speaker Recognition

Over the years, many efforts have been made on improving recognition accuracies on Automatic speech recognition (ASR) and speaker recognition (SRE), and many different technologies have been developed. Given the close relationship between these two tasks, researchers have proposed different ways to introduce techniques developed for these tasks to each other. In this paper an open source experimental framework is proposed for speech and speaker recognition. Then a unified model, Nexus-DNN is developed that is trained jointly for speech and speaker recognition. Experimental results show that the combined model can effectively perform ASR and SRE tasks.

Download Full-text

Generating Robust Audio Adversarial Examples with Temporal Dependency

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/438 ◽

2020 ◽

Author(s):

Hongting Zhang ◽

Pan Zhou ◽

Qiben Yan ◽

Xiao-Yang Liu

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Defense Mechanisms ◽

User Study ◽

State Of The Art ◽

Temporal Structure ◽

Human Perception ◽

Experimental Results ◽

Low Intensity ◽

Adversarial Examples

Audio adversarial examples, imperceptible to humans, have been constructed to attack automatic speech recognition (ASR) systems. However, the adversarial examples generated by existing approaches usually incorporate noticeable noises, especially during the periods of silences and pauses. Moreover, the added noises often break temporal dependency property of the original audio, which can be easily detected by state-of-the-art defense mechanisms. In this paper, we propose a new Iterative Proportional Clipping (IPC) algorithm that preserves temporal dependency in audios for generating more robust adversarial examples. We are motivated by an observation that the temporal dependency in audios imposes a significant effect on human perception. Following our observation, we leverage a proportional clipping strategy to reduce noise during the low-intensity periods. Experimental results and user study both suggest that the generated adversarial examples can significantly reduce human-perceptible noises and resist the defenses based on the temporal structure.

Download Full-text

Multitask Learning with Local Attention for Tibetan Speech Recognition

Complexity ◽

10.1155/2020/8894566 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Hui Wang ◽

Fei Gao ◽

Yue Zhao ◽

Li Yang ◽

Jianjian Yue ◽

...

Keyword(s):

Speech Recognition ◽

Speaker Recognition ◽

Multitask Learning ◽

Experimental Results ◽

Context Information ◽

Accuracy Rate ◽

Baseline Model ◽

Content Recognition ◽

Tibetan Dialects ◽

Speech Content

In this paper, we propose to incorporate the local attention in WaveNet-CTC to improve the performance of Tibetan speech recognition in multitask learning. With an increase in task number, such as simultaneous Tibetan speech content recognition, dialect identification, and speaker recognition, the accuracy rate of a single WaveNet-CTC decreases on speech recognition. Inspired by the attention mechanism, we introduce the local attention to automatically tune the weights of feature frames in a window and pay different attention on context information for multitask learning. The experimental results show that our method improves the accuracies of speech recognition for all Tibetan dialects in three-task learning, compared with the baseline model. Furthermore, our method significantly improves the accuracy for low-resource dialect by 5.11% against the specific-dialect model.

Download Full-text

Building an Open Source Automatic Speech Recognition System for Catalan

10.21437/iberspeech.2018-6 ◽

2018 ◽

Author(s):

Baybars Külebi ◽

Alp Öktem

Keyword(s):

Speech Recognition ◽

Open Source ◽

Automatic Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System

Download Full-text

Evaluating Open-source Toolkits for Automatic Speech Recognition of South African Languages

2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA) ◽

10.1109/robomech.2019.8704774 ◽

2019 ◽

Author(s):

Ashentha Naidoo ◽

Mohohlo Tsoeu

Keyword(s):

Speech Recognition ◽

Open Source ◽

South African ◽

Automatic Speech Recognition ◽

African Languages

Download Full-text

The Psychoacoustics of Automatic Speech Recognition

10.1101/2021.04.19.440438 ◽

2021 ◽

Author(s):

Lotte Weerts ◽

Claudia Clopath ◽

Dan F. M. Goodman

Keyword(s):

Fine Structure ◽

Speech Recognition ◽

Open Source ◽

Auditory System ◽

Automatic Speech Recognition ◽

Qualitative Agreement ◽

State Of The Art ◽

Temporal Fine Structure ◽

Candidate Model ◽

Spectral Invariance

Automatic speech recognition (ASR) software has been suggested as a candidate model of the human auditory system thanks to dramatic improvements in performance in recent years. To test this hypothesis, we compared several state-of-the-art ASR systems to results from humans on a barrage of standard psychoacoustic experiments. While some systems showed qualitative agreement with humans in some tests, in others all tested systems diverged markedly from humans. In particular, none of the models used spectral invariance, temporal fine structure or speech periodicity in a similar way to humans. We conclude that none of the tested ASR systems are yet ready to act as a strong proxy for human speech recognition. However, we note that the more recent systems with better performance also tend to better match human results, suggesting that continued cross-fertilisation of ideas between human and automatic speech recognition may be fruitful. Our software is released as an open-source toolbox to allow researchers to assess future ASR systems or add additional psychoacoustic measures.

Download Full-text