Evaluation of the effectiveness and efficiency of state-of-the-art features and models for automatic speech recognition error detection

AbstractSpeech based human-machine interaction and natural language understanding applications have seen a rapid development and wide adoption over the last few decades. This has led to a proliferation of studies that investigate Error detection and classification in Automatic Speech Recognition (ASR) systems. However, different data sets and evaluation protocols are used, making direct comparisons of the proposed approaches (e.g. features and models) difficult. In this paper we perform an extensive evaluation of the effectiveness and efficiency of state-of-the-art approaches in a unified framework for both errors detection and errors type classification. We make three primary contributions throughout this paper: (1) we have compared our Variant Recurrent Neural Network (V-RNN) model with three other state-of-the-art neural based models, and have shown that the V-RNN model is the most effective classifier for ASR error detection in term of accuracy and speed, (2) we have compared four features’ settings, corresponding to different categories of predictor features and have shown that the generic features are particularly suitable for real-time ASR error detection applications, and (3) we have looked at the post generalization ability of our error detection framework and performed a detailed post detection analysis in order to perceive the recognition errors that are difficult to detect.

Download Full-text

Performance vs. hardware requirements in state-of-the-art automatic speech recognition

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-021-00217-4 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Alexandru-Lucian Georgescu ◽

Alessandro Pappalardo ◽

Horia Cucu ◽

Michaela Blott

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

Decision Makers ◽

Computing Power ◽

Trade Off ◽

Speech Features ◽

Commercial Applications ◽

Guided Tour ◽

Embedded Applications

AbstractThe last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.

Download Full-text

Confidence Measures in Automatic Speech Recognition Systems for Error Detection in Restricted Domains

Advances in Speech and Language Technologies for Iberian Languages - Lecture Notes in Computer Science ◽

10.1007/978-3-319-13623-3_18 ◽

2014 ◽

pp. 168-177

Author(s):

Julia Olcoz ◽

Alfonso Ortega ◽

Antonio Miguel ◽

Eduardo Lleida

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Detection ◽

Confidence Measures ◽

Restricted Domains ◽

Recognition Systems

Download Full-text

Audio-visual automatic speech recognition and related bimodal speech technologies: A review of the state-of-the-art and open problems

2009 IEEE Workshop on Automatic Speech Recognition & Understanding ◽

10.1109/asru.2009.5373530 ◽

2009 ◽

Cited By ~ 2

Author(s):

Gerasimos Potamianos

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

The State ◽

Open Problems

Download Full-text

Generating Robust Audio Adversarial Examples with Temporal Dependency

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/438 ◽

2020 ◽

Author(s):

Hongting Zhang ◽

Pan Zhou ◽

Qiben Yan ◽

Xiao-Yang Liu

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Defense Mechanisms ◽

User Study ◽

State Of The Art ◽

Temporal Structure ◽

Human Perception ◽

Experimental Results ◽

Low Intensity ◽

Adversarial Examples

Audio adversarial examples, imperceptible to humans, have been constructed to attack automatic speech recognition (ASR) systems. However, the adversarial examples generated by existing approaches usually incorporate noticeable noises, especially during the periods of silences and pauses. Moreover, the added noises often break temporal dependency property of the original audio, which can be easily detected by state-of-the-art defense mechanisms. In this paper, we propose a new Iterative Proportional Clipping (IPC) algorithm that preserves temporal dependency in audios for generating more robust adversarial examples. We are motivated by an observation that the temporal dependency in audios imposes a significant effect on human perception. Following our observation, we leverage a proportional clipping strategy to reduce noise during the low-intensity periods. Experimental results and user study both suggest that the generated adversarial examples can significantly reduce human-perceptible noises and resist the defenses based on the temporal structure.

Download Full-text

Automatic Speech Recognition System for Tonal Languages: State-of-the-Art Survey

Archives of Computational Methods in Engineering ◽

10.1007/s11831-020-09414-4 ◽

2020 ◽

Author(s):

Jaspreet Kaur ◽

Amitoj Singh ◽

Virender Kadyan

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System

Download Full-text

Automatic Sarcasm Detection with Textual and Acoustic Data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7215.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 1357-1360

Keyword(s):

Speech Recognition ◽

Text Mining ◽

Automatic Speech Recognition ◽

Rapid Development ◽

Acoustic Features ◽

Data Types ◽

Text Data ◽

Acoustic Data ◽

Textual Data ◽

Textual Features

This paper takes focus on the area of automatic sarcasm detection. Automatic sarcasm detection is crucial due to the needs of sentimental analysis. The rapid development of automatic speech recognition and text mining and the large amount of voice and text data opens a broader way for researchers to open new method and improves the accuracy of automatic sarcasm detection. We observe approaches that have been used to detect sarcasm, kind of data and its features including the rises of context to improve the accuracy of automatic sarcasm detection. We found that some context cannot be reliable without the presence of other context and some approaches are very dependent on the dataset. Twitter is being used by researchers as the main mine for sentimental analysis, we notice that at some aspect it still has a flaw because it is dependent to some Twitter’s special feature that will not be found in other usual text data like hashtags and author history. Besides that, we see that the small amount of research about automatic sarcasm detection through acoustic data and its correlation with textual data could make a new opportunity in the area of sarcasm detection in speech. From acoustic data, we could get both acoustic features and textual features. Sarcasm detection with voice has the potential to get higher accuracy since it can be extracted into two data types. By describing each beneficial method, this paper could be a brief way to sarcasm detection through acoustic and textual data.

Download Full-text

Using Privacy-Transformed Speech in the Automatic Speech Recognition Acoustic Model Training

Frontiers in Artificial Intelligence and Applications - Human Language Technologies – The Baltic Perspective ◽

10.3233/faia200601 ◽

2020 ◽

Author(s):

Askars Salimbajevs

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

Speaker Verification ◽

Voice Conversion ◽

Acoustic Model ◽

Acoustic Models ◽

Speech Data ◽

Model Training ◽

The Voice

Automatic Speech Recognition (ASR) requires huge amounts of real user speech data to reach state-of-the-art performance. However, speech data conveys sensitive speaker attributes like identity that can be inferred and exploited for malicious purposes. Therefore, there is an interest in the collection of anonymized speech data that is processed by some voice conversion method. In this paper, we evaluate one of the voice conversion methods on Latvian speech data and also investigate if privacy-transformed data can be used to improve ASR acoustic models. Results show the effectiveness of voice conversion against state-of-the-art speaker verification models on Latvian speech and the effectiveness of using privacy-transformed data in ASR training.

Download Full-text

Computer-Assisted Interpreting Tools (CAI) and options for automation with Automatic Speech Recognition

Tradterm ◽

10.11606/issn.2317-9511.v32i0p9-31 ◽

2018 ◽

Vol 32 ◽

pp. 9-31

Author(s):

Luis Eduardo Schild Ortiz ◽

Patrizia Cavallo

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

New Technologies ◽

State Of The Art ◽

The State ◽

Current Level ◽

Computer Assisted ◽

Future Perspectives ◽

Potential Benefits

In recent years, several studies have indicated interpreters resist adopting new technologies. Yet, such technologies have enabled the development of several tools to help those professionals. In this paper, using bibliographical and documental research, we briefly analyse the tools cited by several authors to identify which ones remain up to date and available on the market. Following that, we present concepts about automation, and observe the usage of automatic speech recognition (ASR), while analysing its potential benefits and the current level of maturity of such an approach, especially regarding Computer-Assisted Interpreting (CAI) tools. The goal of this paper is to present the community of interpreters and researchers with a view of the state of the art in technology for interpreting as well as some future perspectives for this area.

Download Full-text