Audio-visual automatic speech recognition and related bimodal speech technologies: A review of the state-of-the-art and open problems

Author(s):  
Gerasimos Potamianos
Tradterm ◽  
2018 ◽  
Vol 32 ◽  
pp. 9-31
Author(s):  
Luis Eduardo Schild Ortiz ◽  
Patrizia Cavallo

In recent years, several studies have indicated interpreters resist adopting new technologies. Yet, such technologies have enabled the development of several tools to help those professionals. In this paper, using bibliographical and documental research, we briefly analyse the tools cited by several authors to identify which ones remain up to date and available on the market. Following that, we present concepts about automation, and observe the usage of automatic speech recognition (ASR), while analysing its potential benefits and the current level of maturity of such an approach, especially regarding Computer-Assisted Interpreting (CAI) tools. The goal of this paper is to present the community of interpreters and researchers with a view of the state of the art in technology for interpreting as well as some future perspectives for this area.


Author(s):  
Danny Henry Galatang ◽  
◽  
Suyanto Suyanto ◽  

The syllable-based automatic speech recognition (ASR) systems commonly perform better than the phoneme-based ones. This paper focuses on developing an Indonesian monosyllable-based ASR (MSASR) system using an ASR engine called SPRAAK and comparing it to a phoneme-based one. The Mozilla DeepSpeech-based end-to-end ASR (MDSE2EASR), one of the state-of-the-art models based on character (similar to the phoneme-based model), is also investigated to confirm the result. Besides, a novel Kaituoxu SpeechTransformer (KST) E2EASR is also examined. Testing on the Indonesian speech corpus of 5,439 words shows that the proposed MSASR produces much higher word accuracy (76.57%) than the monophone-based one (63.36%). Its performance is comparable to the character-based MDS-E2EASR, which produces 76.90%, and the character-based KST-E2EASR (78.00%). In the future, this monosyllable-based ASR is possible to be improved to the bisyllable-based one to give higher word accuracy. Nevertheless, extensive bisyllable acoustic models must be handled using an advanced method.


Author(s):  
Andrew Rosenberg ◽  
Mark Hasegawa-Johnson

Automatic prosody labelling is a useful front-end for automatic speech recognition, for automatic speech understanding, and for the development of corpora used to create speech synthesizers. Automatic labelling of prosody has also proven to be quite useful in the linguistic analysis of new speaking styles in a known language. This chapter provides a survey of the state-of-the-art best practices and open questions in the automatic labelling of prosodic information and its assessment. It describes the major prosodic inventories that are used in prosody labelling. It then discusses the relevance of acoustics and syntax in automatic labelling. A brief description of AuToBI, a tool that performs automatic ToBI labelling of US English, is provided. The chapter concludes by discussing methods of evaluating automatic prosody labelling.


Author(s):  
R. W. A. Scarr ◽  
W. Bezdel

The ‘state of the art’ in speech recognition is reviewed with particular reference to the kind of problems that are likely to arise in a parcel sorting environment. Speech recognition equipment developed by the authors is described. To justify a speech recognition equipment for parcel sorting it must be shown to increase productivity. Simulations relevant to voice control of parcel sorting have been carried out to try to assess what this improvement might be, and the results are discussed.


Author(s):  
Alexandru-Lucian Georgescu ◽  
Alessandro Pappalardo ◽  
Horia Cucu ◽  
Michaela Blott

AbstractThe last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Asmaa El Hannani ◽  
Rahhal Errattahi ◽  
Fatima Zahra Salmam ◽  
Thomas Hain ◽  
Hassan Ouahmane

AbstractSpeech based human-machine interaction and natural language understanding applications have seen a rapid development and wide adoption over the last few decades. This has led to a proliferation of studies that investigate Error detection and classification in Automatic Speech Recognition (ASR) systems. However, different data sets and evaluation protocols are used, making direct comparisons of the proposed approaches (e.g. features and models) difficult. In this paper we perform an extensive evaluation of the effectiveness and efficiency of state-of-the-art approaches in a unified framework for both errors detection and errors type classification. We make three primary contributions throughout this paper: (1) we have compared our Variant Recurrent Neural Network (V-RNN) model with three other state-of-the-art neural based models, and have shown that the V-RNN model is the most effective classifier for ASR error detection in term of accuracy and speed, (2) we have compared four features’ settings, corresponding to different categories of predictor features and have shown that the generic features are particularly suitable for real-time ASR error detection applications, and (3) we have looked at the post generalization ability of our error detection framework and performed a detailed post detection analysis in order to perceive the recognition errors that are difficult to detect.


2013 ◽  
Vol 15 (03) ◽  
pp. 1340015 ◽  
Author(s):  
VITO FRAGNELLI ◽  
STEFANO GAGLIARDO

Location problems describe those situations in which one or more facilities have to be placed in a region trying to optimize a suitable objective function. Game theory has been used as a tool to solve location problems and this paper is devoted to describe the state-of-the-art of the research on location problems through the tools of game theory. Particular attention is given to the problems that are still open in the field of cooperative location game theory.


2013 ◽  
Vol 15 (02) ◽  
pp. 1340006 ◽  
Author(s):  
MICHELA CHESSA ◽  
VITO FRAGNELLI

The issue of veto may play an important role in an approval situation, mainly in political science, where several scholars dealt with this topic. In this survey we want to update the state-of-the-art, paying particular attention to the open problems that various authors pointed out in their research fields.


Author(s):  
Hongting Zhang ◽  
Pan Zhou ◽  
Qiben Yan ◽  
Xiao-Yang Liu

Audio adversarial examples, imperceptible to humans, have been constructed to attack automatic speech recognition (ASR) systems. However, the adversarial examples generated by existing approaches usually incorporate noticeable noises, especially during the periods of silences and pauses. Moreover, the added noises often break temporal dependency property of the original audio, which can be easily detected by state-of-the-art defense mechanisms. In this paper, we propose a new Iterative Proportional Clipping (IPC) algorithm that preserves temporal dependency in audios for generating more robust adversarial examples. We are motivated by an observation that the temporal dependency in audios imposes a significant effect on human perception. Following our observation, we leverage a proportional clipping strategy to reduce noise during the low-intensity periods. Experimental results and user study both suggest that the generated adversarial examples can significantly reduce human-perceptible noises and resist the defenses based on the temporal structure.


Sign in / Sign up

Export Citation Format

Share Document