Session 11: Some Research Projects: Application of Automatic Speech Recognition to Parcel Sorting

Author(s):  
R. W. A. Scarr ◽  
W. Bezdel

The ‘state of the art’ in speech recognition is reviewed with particular reference to the kind of problems that are likely to arise in a parcel sorting environment. Speech recognition equipment developed by the authors is described. To justify a speech recognition equipment for parcel sorting it must be shown to increase productivity. Simulations relevant to voice control of parcel sorting have been carried out to try to assess what this improvement might be, and the results are discussed.

Tradterm ◽  
2018 ◽  
Vol 32 ◽  
pp. 9-31
Author(s):  
Luis Eduardo Schild Ortiz ◽  
Patrizia Cavallo

In recent years, several studies have indicated interpreters resist adopting new technologies. Yet, such technologies have enabled the development of several tools to help those professionals. In this paper, using bibliographical and documental research, we briefly analyse the tools cited by several authors to identify which ones remain up to date and available on the market. Following that, we present concepts about automation, and observe the usage of automatic speech recognition (ASR), while analysing its potential benefits and the current level of maturity of such an approach, especially regarding Computer-Assisted Interpreting (CAI) tools. The goal of this paper is to present the community of interpreters and researchers with a view of the state of the art in technology for interpreting as well as some future perspectives for this area.


Author(s):  
Danny Henry Galatang ◽  
◽  
Suyanto Suyanto ◽  

The syllable-based automatic speech recognition (ASR) systems commonly perform better than the phoneme-based ones. This paper focuses on developing an Indonesian monosyllable-based ASR (MSASR) system using an ASR engine called SPRAAK and comparing it to a phoneme-based one. The Mozilla DeepSpeech-based end-to-end ASR (MDSE2EASR), one of the state-of-the-art models based on character (similar to the phoneme-based model), is also investigated to confirm the result. Besides, a novel Kaituoxu SpeechTransformer (KST) E2EASR is also examined. Testing on the Indonesian speech corpus of 5,439 words shows that the proposed MSASR produces much higher word accuracy (76.57%) than the monophone-based one (63.36%). Its performance is comparable to the character-based MDS-E2EASR, which produces 76.90%, and the character-based KST-E2EASR (78.00%). In the future, this monosyllable-based ASR is possible to be improved to the bisyllable-based one to give higher word accuracy. Nevertheless, extensive bisyllable acoustic models must be handled using an advanced method.


Author(s):  
Andrew Rosenberg ◽  
Mark Hasegawa-Johnson

Automatic prosody labelling is a useful front-end for automatic speech recognition, for automatic speech understanding, and for the development of corpora used to create speech synthesizers. Automatic labelling of prosody has also proven to be quite useful in the linguistic analysis of new speaking styles in a known language. This chapter provides a survey of the state-of-the-art best practices and open questions in the automatic labelling of prosodic information and its assessment. It describes the major prosodic inventories that are used in prosody labelling. It then discusses the relevance of acoustics and syntax in automatic labelling. A brief description of AuToBI, a tool that performs automatic ToBI labelling of US English, is provided. The chapter concludes by discussing methods of evaluating automatic prosody labelling.


Author(s):  
Alexandru-Lucian Georgescu ◽  
Alessandro Pappalardo ◽  
Horia Cucu ◽  
Michaela Blott

AbstractThe last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Asmaa El Hannani ◽  
Rahhal Errattahi ◽  
Fatima Zahra Salmam ◽  
Thomas Hain ◽  
Hassan Ouahmane

AbstractSpeech based human-machine interaction and natural language understanding applications have seen a rapid development and wide adoption over the last few decades. This has led to a proliferation of studies that investigate Error detection and classification in Automatic Speech Recognition (ASR) systems. However, different data sets and evaluation protocols are used, making direct comparisons of the proposed approaches (e.g. features and models) difficult. In this paper we perform an extensive evaluation of the effectiveness and efficiency of state-of-the-art approaches in a unified framework for both errors detection and errors type classification. We make three primary contributions throughout this paper: (1) we have compared our Variant Recurrent Neural Network (V-RNN) model with three other state-of-the-art neural based models, and have shown that the V-RNN model is the most effective classifier for ASR error detection in term of accuracy and speed, (2) we have compared four features’ settings, corresponding to different categories of predictor features and have shown that the generic features are particularly suitable for real-time ASR error detection applications, and (3) we have looked at the post generalization ability of our error detection framework and performed a detailed post detection analysis in order to perceive the recognition errors that are difficult to detect.


Author(s):  
Jarne R. Verpoorten ◽  
Miche`le Auglaire ◽  
Frank Bertels

During a hypothetical Severe Accident (SA), core damage is to be expected due to insufficient core cooling. If the lack of core cooling persists, the degradation of the core can continue and could lead to the presence of corium in the lower plenum. There, the thermo-mechanical attack of the lower head by the corium could eventually lead to vessel failure and corium release to the reactor cavity pit. In this paper, it is described how the international state-of-the-art knowledge has been applied in combination with plant-specific data in order to obtain a custom Severe Accident Management (SAM) approach and hardware adaptations for existing NPPs. Also the interest of Tractebel Engineering in future SA research projects related to this topic will be addressed from the viewpoint of keeping the analysis up-to-date with the state-of-the art knowledge.


Author(s):  
Hongting Zhang ◽  
Pan Zhou ◽  
Qiben Yan ◽  
Xiao-Yang Liu

Audio adversarial examples, imperceptible to humans, have been constructed to attack automatic speech recognition (ASR) systems. However, the adversarial examples generated by existing approaches usually incorporate noticeable noises, especially during the periods of silences and pauses. Moreover, the added noises often break temporal dependency property of the original audio, which can be easily detected by state-of-the-art defense mechanisms. In this paper, we propose a new Iterative Proportional Clipping (IPC) algorithm that preserves temporal dependency in audios for generating more robust adversarial examples. We are motivated by an observation that the temporal dependency in audios imposes a significant effect on human perception. Following our observation, we leverage a proportional clipping strategy to reduce noise during the low-intensity periods. Experimental results and user study both suggest that the generated adversarial examples can significantly reduce human-perceptible noises and resist the defenses based on the temporal structure.


Author(s):  
Christian PADILLA-NAVARRO ◽  
Carlos ZARATE-TREJO ◽  
Georges KHALAF ◽  
Pascal FALLAVOLLITA

Alexithymia is a condition that partially or completely deprives you of the ability to identify and describe emotions, and to show affective connotations in the actions of an individual. This problem has been taken to different research projects that seek to study its characteristics, forms of prevention, and implications, and that try to determine a measurement for the experience of an individual with this construct as well as the responses they provide to certain stimuli. Other studies that were reviewed aimed to find a connection between the responses of subjects diagnosed with alexithymia when facing a dynamic of emotional facial expressions to recognize and their assigned grade based on the Toronto Alexithymia Scale (TAS), a metric frequently used to evaluate the presence or absence of alexithymia in an individual. In this work, a review of the different articles that study this connection, as well as articles that describe the state of the art of the implementation of artificial intelligence algorithms applied to the treatment or prevention of secondary alexithymia is presented.


Sign in / Sign up

Export Citation Format

Share Document