word error rate
Recently Published Documents


TOTAL DOCUMENTS

98
(FIVE YEARS 48)

H-INDEX

9
(FIVE YEARS 2)

Author(s):  
Olov Engwall ◽  
José Lopes ◽  
Ronald Cumbal

AbstractThe large majority of previous work on human-robot conversations in a second language has been performed with a human wizard-of-Oz. The reasons are that automatic speech recognition of non-native conversational speech is considered to be unreliable and that the dialogue management task of selecting robot utterances that are adequate at a given turn is complex in social conversations. This study therefore investigates if robot-led conversation practice in a second language with pairs of adult learners could potentially be managed by an autonomous robot. We first investigate how correct and understandable transcriptions of second language learner utterances are when made by a state-of-the-art speech recogniser. We find both a relatively high word error rate (41%) and that a substantial share (42%) of the utterances are judged to be incomprehensible or only partially understandable by a human reader. We then evaluate how adequate the robot utterance selection is, when performed manually based on the speech recognition transcriptions or autonomously using (a) predefined sequences of robot utterances, (b) a general state-of-the-art language model that selects utterances based on learner input or the preceding robot utterance, or (c) a custom-made statistical method that is trained on observations of the wizard’s choices in previous conversations. It is shown that adequate or at least acceptable robot utterances are selected by the human wizard in most cases (96%), even though the ASR transcriptions have a high word error rate. Further, the custom-made statistical method performs as well as manual selection of robot utterances based on ASR transcriptions. It was also found that the interaction strategy that the robot employed, which differed regarding how much the robot maintained the initiative in the conversation and if the focus of the conversation was on the robot or the learners, had marginal effects on the word error rate and understandability of the transcriptions but larger effects on the adequateness of the utterance selection. Autonomous robot-led conversations may hence work better with some robot interaction strategies.


2022 ◽  
Vol 31 (1) ◽  
pp. 159-167
Author(s):  
Yijun Wu ◽  
Yonghong Qin

Abstract In order to improve the efficiency of the English translation, machine translation is gradually and widely used. This study briefly introduces the neural network algorithm for speech recognition. Long short-term memory (LSTM), instead of traditional recurrent neural network (RNN), was used as the encoding algorithm for the encoder, and RNN as the decoding algorithm for the decoder. Then, simulation experiments were carried out on the machine translation algorithm, and it was compared with two other machine translation algorithms. The results showed that the back-propagation (BP) neural network had a lower word error rate and spent less recognition time than artificial recognition in recognizing the speech; the LSTM–RNN algorithm had a lower word error rate than BP–RNN and RNN–RNN algorithms in recognizing the test samples. In the actual speech translation test, as the length of speech increased, the LSTM–RNN algorithm had the least changes in the translation score and word error rate, and it had the highest translation score and the lowest word error rate under the same speech length.


2021 ◽  
Vol 02 (02) ◽  
Author(s):  
Hasan Ali Gamal Al-Kaf ◽  
◽  
Muhammad Suhaimi Sulong ◽  
Ariffuddin Joret ◽  
Nuramin Fitri Aminuddin ◽  
...  

The recitation of Quran verses according to the actual tajweed is obligatory and it must be accurate and precise in pronunciation. Hence, it should always be reviewed by an expert on the recitation of the Quran. Through the latest technology, this recitation review can be implemented through an application system and it is most appropriate in this current Covid-19 pandemic situation where system application online is deemed to be developed. In this empirical study, a recognition system so-called the Quranic Verse Recitation Recognition (QVR) system using PocketSphinx to convert the Quranic verse from Arabic sound to Roman text and determine the accuracy of reciters, has been developed. The Graphical User Interface (GUI) of the system with a user-friendly environment was designed using Microsoft Visual Basic 6 in an Ubuntu platform. A verse of surah al-Ikhlas has been chosen in this study and the data were collected by recording 855 audios as training data recorded by professional reciters. Another 105 audios were collected as testing data, to test the accuracy of the system. The results indicate that the system obtained a 100% accuracy with a 0.00% of word error rate (WER) for both training and testing data of the said audios via Quran Roman text. The system with automatic speech recognition (ASR) engine system demonstrates that it has been successfully designed and developed, and is significant to be extended further. Added, it will be improved with the addition of other Quran surahs.


2021 ◽  
Vol 13 (22) ◽  
pp. 12392
Author(s):  
Santosh Gondi ◽  
Vineel Pratap

Automatic speech recognition, a process of converting speech signals to text, has improved a great deal in the past decade thanks to the deep learning based systems. With the latest transformer based models, the recognition accuracy measured as word-error-rate (WER), is even below the human annotator error (4%). However, most of these advanced models run on big servers with large amounts of memory, CPU/GPU resources and have huge carbon footprint. This server based architecture of ASR is not viable in the long run given the inherent lack of privacy for user data, reliability and latency issues of the network connection. On the other hand, on-device ASR (meaning, speech to text conversion on the edge device itself) solutions will fix deep-rooted privacy issues while at same time being more reliable and performant by avoiding network connectivity to the back-end server. On-device ASR can also lead to a more sustainable solution by considering the energy vs. accuracy trade-off and choosing right model for specific use cases/applications of the product. Hence, in this paper we evaluate energy-accuracy trade-off of ASR with a typical transformer based speech recognition model on an edge device. We have run evaluations on Raspberry Pi with an off-the-shelf USB meter for measuring energy consumption. We conclude that, in the case of CPU based ASR inference, the energy consumption grows exponentially as the word error rate improves linearly. Additionally, based on our experiment we deduce that, with PyTorch mobile optimization and quantization, the typical transformer based ASR on edge performs reasonably well in terms of accuracy and latency and comes close to the accuracy of server based inference.


2021 ◽  
pp. 1-13
Author(s):  
Hamzah A. Alsayadi ◽  
Abdelaziz A. Abdelhamid ◽  
Islam Hegazy ◽  
Zaki T. Fayed

Arabic language has a set of sound letters called diacritics, these diacritics play an essential role in the meaning of words and their articulations. The change in some diacritics leads to a change in the context of the sentence. However, the existence of these letters in the corpus transcription affects the accuracy of speech recognition. In this paper, we investigate the effect of diactrics on the Arabic speech recognition based end-to-end deep learning. The applied end-to-end approach includes CNN-LSTM and attention-based technique presented in the state-of-the-art framework namely, Espresso using Pytorch. In addition, and to the best of our knowledge, the approach of CNN-LSTM with attention-based has not been used in the task of Arabic Automatic speech recognition (ASR). To fill this gap, this paper proposes a new approach based on CNN-LSTM with attention based method for Arabic ASR. The language model in this approach is trained using RNN-LM and LSTM-LM and based on nondiacritized transcription of the speech corpus. The Standard Arabic Single Speaker Corpus (SASSC), after omitting the diacritics, is used to train and test the deep learning model. Experimental results show that the removal of diacritics decreased out-of-vocabulary and perplexity of the language model. In addition, the word error rate (WER) is significantly improved when compared to diacritized data. The achieved average reduction in WER is 13.52%.


2021 ◽  
Author(s):  
Zhong Meng ◽  
Yu Wu ◽  
Naoyuki Kanda ◽  
Liang Lu ◽  
Xie Chen ◽  
...  

2021 ◽  
Author(s):  
Liang Lu ◽  
Zhong Meng ◽  
Naoyuki Kanda ◽  
Jinyu Li ◽  
Yifan Gong
Keyword(s):  

2021 ◽  
Vol 9 (1) ◽  
pp. 1-15
Author(s):  
Vered Silber Varod ◽  
Ingo Siegert ◽  
Oliver Jokisch ◽  
Yamini Sinha ◽  
Nitza Geri

Despite the growing importance of Automatic Speech Recognition (ASR), its application is still challenging, limited, language-dependent, and requires considerable resources. The resources required for ASR are not only technical, they also need to reflect technological trends and cultural diversity. The purpose of this research is to explore ASR performance gaps by a comparative study of American English, German, and Hebrew. Apart from different languages, we also investigate different speaking styles – utterances from spontaneous dialogues and utterances from frontal lectures (TED-like genre). The analysis includes a comparison of the performance of four ASR engines (Google Cloud, Google Search, IBM Watson, and WIT.ai) using four commonly used metrics: Word Error Rate (WER); Character Error Rate (CER); Word Information Lost (WIL); and Match Error Rate (MER). As expected, findings suggest that English ASR systems provide the best results. Contrary to our hypothesis regarding ASR’s low performance for under-resourced languages, we found that the Hebrew and German ASR systems have similar performance. Overall, our findings suggest that ASR performance is language-dependent and system-dependent. Furthermore, ASR may be genre-sensitive, as our results showed for German. This research contributes a valuable insight for improving ubiquitous global consumption and management of knowledge and calls for corporate social responsibility of commercial companies, to develop ASR under Fair, Reasonable, and Non-Discriminatory (FRAND) terms


2021 ◽  
Vol 35 (3) ◽  
pp. 235-242
Author(s):  
Vivek Bhardwaj ◽  
Vinay Kukreja ◽  
Amitoj Singh

Most of the automatic speech recognition (ASR) systems are trained using adult speech due to the less availability of the children's speech dataset. The speech recognition rate of such systems is very less when tested using the children's speech, due to the presence of the inter-speaker acoustic variabilities between the adults and children's speech. These inter-speaker acoustic variabilities are mainly because of the higher pitch and lower speaking rate of the children. Thus, the main objective of the research work is to increase the speech recognition rate of the Punjabi-ASR system by reducing these inter-speaker acoustic variabilities with the help of prosody modification and speaker adaptive training. The pitch period and duration (speaking rate) of the speech signal can be altered with prosody modification without influencing the naturalness, message of the signal and helps to overcome the acoustic variations present in the adult's and children's speech. The developed Punjabi-ASR system is trained with the help of adult speech and prosody-modified adult speech. This prosody modified speech overcomes the massive need for children's speech for training the ASR system and improves the recognition rate. Results show that prosody modification and speaker adaptive training helps to minimize the word error rate (WER) of the Punjabi-ASR system to 8.79% when tested using children's speech.


2021 ◽  
Author(s):  
Mahadeva Swamy ◽  
D J Ravi

Abstract An ASR system is built for the Continuous Kannada Speech Recognition. The acoustic and language models are created with the help of the Kaldi toolkit. The speech database is created with the native male and female Kannada speakers. The 75% of collected speech data is used for training the acoustic models and 25% of speech database is used for the system testing. The Performance of the system is presented interms of Word Error Rate (WER). Wavelet Packet Decomposition along with Mel filter bank is used to achieve feature extraction. The proposed feature extraction performs slightly better than the conventional features such as MFCC, PLP interms of WRA and WER under uncontrolled conditions. For the speech corpus collected in Kannada Language, the proposed features shows an improvement in WRA of 1.79% over baseline features.


Sign in / Sign up

Export Citation Format

Share Document