Audio-visual automatic speech recognition and related bimodal speech technologies: A review of the state-of-the-art and open problems

In recent years, several studies have indicated interpreters resist adopting new technologies. Yet, such technologies have enabled the development of several tools to help those professionals. In this paper, using bibliographical and documental research, we briefly analyse the tools cited by several authors to identify which ones remain up to date and available on the market. Following that, we present concepts about automation, and observe the usage of automatic speech recognition (ASR), while analysing its potential benefits and the current level of maturity of such an approach, especially regarding Computer-Assisted Interpreting (CAI) tools. The goal of this paper is to present the community of interpreters and researchers with a view of the state of the art in technology for interpreting as well as some future perspectives for this area.

Download Full-text

Syllable-Based Indonesian Automatic Speech Recognition

International Journal on Electrical Engineering and Informatics ◽

10.15676/ijeei.2020.12.4.2 ◽

2020 ◽

Vol 12 (4) ◽

pp. 720-728

Author(s):

Danny Henry Galatang ◽

◽

Suyanto Suyanto ◽

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

The State ◽

Speech Corpus ◽

Advanced Method ◽

Acoustic Models ◽

The Future ◽

End To End ◽

Better Than

The syllable-based automatic speech recognition (ASR) systems commonly perform better than the phoneme-based ones. This paper focuses on developing an Indonesian monosyllable-based ASR (MSASR) system using an ASR engine called SPRAAK and comparing it to a phoneme-based one. The Mozilla DeepSpeech-based end-to-end ASR (MDSE2EASR), one of the state-of-the-art models based on character (similar to the phoneme-based model), is also investigated to confirm the result. Besides, a novel Kaituoxu SpeechTransformer (KST) E2EASR is also examined. Testing on the Indonesian speech corpus of 5,439 words shows that the proposed MSASR produces much higher word accuracy (76.57%) than the monophone-based one (63.36%). Its performance is comparable to the character-based MDS-E2EASR, which produces 76.90%, and the character-based KST-E2EASR (78.00%). In the future, this monosyllable-based ASR is possible to be improved to the bisyllable-based one to give higher word accuracy. Nevertheless, extensive bisyllable acoustic models must be handled using an advanced method.

Download Full-text

Automatic Prosody Labelling and Assessment

The Oxford Handbook of Language Prosody ◽

10.1093/oxfordhb/9780198832232.013.43 ◽

2020 ◽

pp. 645-656

Author(s):

Andrew Rosenberg ◽

Mark Hasegawa-Johnson

Keyword(s):

Speech Recognition ◽

Best Practices ◽

Automatic Speech Recognition ◽

State Of The Art ◽

Linguistic Analysis ◽

The State ◽

Speech Understanding ◽

Front End ◽

Open Questions

Automatic prosody labelling is a useful front-end for automatic speech recognition, for automatic speech understanding, and for the development of corpora used to create speech synthesizers. Automatic labelling of prosody has also proven to be quite useful in the linguistic analysis of new speaking styles in a known language. This chapter provides a survey of the state-of-the-art best practices and open questions in the automatic labelling of prosodic information and its assessment. It describes the major prosodic inventories that are used in prosody labelling. It then discusses the relevance of acoustics and syntax in automatic labelling. A brief description of AuToBI, a tool that performs automatic ToBI labelling of US English, is provided. The chapter concludes by discussing methods of evaluating automatic prosody labelling.

Download Full-text

Session 11: Some Research Projects: Application of Automatic Speech Recognition to Parcel Sorting

Proceedings of the Institution of Mechanical Engineers Conference Proceedings ◽

10.1243/pime_conf_1969_184_267_02 ◽

1969 ◽

Vol 184 (8) ◽

pp. 348-353

Author(s):

R. W. A. Scarr ◽

W. Bezdel

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

The State ◽

Research Projects ◽

Voice Control ◽

Increase Productivity

The ‘state of the art’ in speech recognition is reviewed with particular reference to the kind of problems that are likely to arise in a parcel sorting environment. Speech recognition equipment developed by the authors is described. To justify a speech recognition equipment for parcel sorting it must be shown to increase productivity. Simulations relevant to voice control of parcel sorting have been carried out to try to assess what this improvement might be, and the results are discussed.

Download Full-text

Performance vs. hardware requirements in state-of-the-art automatic speech recognition

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-021-00217-4 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Alexandru-Lucian Georgescu ◽

Alessandro Pappalardo ◽

Horia Cucu ◽

Michaela Blott

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

State Of The Art ◽

Decision Makers ◽

Computing Power ◽

Trade Off ◽

Speech Features ◽

Commercial Applications ◽

Guided Tour ◽

Embedded Applications

AbstractThe last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.

Download Full-text

Evaluation of the effectiveness and efficiency of state-of-the-art features and models for automatic speech recognition error detection

Journal Of Big Data ◽

10.1186/s40537-020-00391-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Asmaa El Hannani ◽

Rahhal Errattahi ◽

Fatima Zahra Salmam ◽

Thomas Hain ◽

Hassan Ouahmane

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Error Detection ◽

State Of The Art ◽

Rapid Development ◽

Unified Framework ◽

Human Machine Interaction ◽

Detection Analysis ◽

Extensive Evaluation ◽

Effectiveness And Efficiency

AbstractSpeech based human-machine interaction and natural language understanding applications have seen a rapid development and wide adoption over the last few decades. This has led to a proliferation of studies that investigate Error detection and classification in Automatic Speech Recognition (ASR) systems. However, different data sets and evaluation protocols are used, making direct comparisons of the proposed approaches (e.g. features and models) difficult. In this paper we perform an extensive evaluation of the effectiveness and efficiency of state-of-the-art approaches in a unified framework for both errors detection and errors type classification. We make three primary contributions throughout this paper: (1) we have compared our Variant Recurrent Neural Network (V-RNN) model with three other state-of-the-art neural based models, and have shown that the V-RNN model is the most effective classifier for ASR error detection in term of accuracy and speed, (2) we have compared four features’ settings, corresponding to different categories of predictor features and have shown that the generic features are particularly suitable for real-time ASR error detection applications, and (3) we have looked at the post generalization ability of our error detection framework and performed a detailed post detection analysis in order to perceive the recognition errors that are difficult to detect.

Download Full-text

OPEN PROBLEMS IN COOPERATIVE LOCATION GAMES

International Game Theory Review ◽

10.1142/s021919891340015x ◽

2013 ◽

Vol 15 (03) ◽

pp. 1340015 ◽

Cited By ~ 3

Author(s):

VITO FRAGNELLI ◽

STEFANO GAGLIARDO

Keyword(s):

Game Theory ◽

Objective Function ◽

State Of The Art ◽

The State ◽

Location Problems ◽

Open Problems ◽

Location Games

Location problems describe those situations in which one or more facilities have to be placed in a region trying to optimize a suitable objective function. Game theory has been used as a tool to solve location problems and this paper is devoted to describe the state-of-the-art of the research on location problems through the tools of game theory. Particular attention is given to the problems that are still open in the field of cooperative location game theory.

Download Full-text

XML-to-SQL Query Translation Literature: The State of the Art and Open Problems

Database and XML Technologies - Lecture Notes in Computer Science ◽

10.1007/978-3-540-39429-7_1 ◽

2003 ◽

pp. 1-18 ◽

Cited By ~ 40

Author(s):

Rajasekar Krishnamurthy ◽

Raghav Kaushik ◽

Jeffrey F. Naughton

Keyword(s):

State Of The Art ◽

The State ◽

Open Problems ◽

Query Translation ◽

Sql Query

Download Full-text

OPEN PROBLEMS IN VETO THEORY

International Game Theory Review ◽

10.1142/s0219198913400069 ◽

2013 ◽

Vol 15 (02) ◽

pp. 1340006 ◽

Cited By ~ 1

Author(s):

MICHELA CHESSA ◽

VITO FRAGNELLI

Keyword(s):

Political Science ◽

State Of The Art ◽

The State ◽

Open Problems ◽

Research Fields

The issue of veto may play an important role in an approval situation, mainly in political science, where several scholars dealt with this topic. In this survey we want to update the state-of-the-art, paying particular attention to the open problems that various authors pointed out in their research fields.

Download Full-text

Generating Robust Audio Adversarial Examples with Temporal Dependency

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/438 ◽

2020 ◽

Author(s):

Hongting Zhang ◽

Pan Zhou ◽

Qiben Yan ◽

Xiao-Yang Liu

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Defense Mechanisms ◽

User Study ◽

State Of The Art ◽

Temporal Structure ◽

Human Perception ◽

Experimental Results ◽

Low Intensity ◽

Adversarial Examples

Audio adversarial examples, imperceptible to humans, have been constructed to attack automatic speech recognition (ASR) systems. However, the adversarial examples generated by existing approaches usually incorporate noticeable noises, especially during the periods of silences and pauses. Moreover, the added noises often break temporal dependency property of the original audio, which can be easily detected by state-of-the-art defense mechanisms. In this paper, we propose a new Iterative Proportional Clipping (IPC) algorithm that preserves temporal dependency in audios for generating more robust adversarial examples. We are motivated by an observation that the temporal dependency in audios imposes a significant effect on human perception. Following our observation, we leverage a proportional clipping strategy to reduce noise during the low-intensity periods. Experimental results and user study both suggest that the generated adversarial examples can significantly reduce human-perceptible noises and resist the defenses based on the temporal structure.

Download Full-text