speech corpus
Recently Published Documents


TOTAL DOCUMENTS

458
(FIVE YEARS 149)

H-INDEX

17
(FIVE YEARS 4)

Author(s):  
Deepang Raval ◽  
Vyom Pathak ◽  
Muktan Patel ◽  
Brijesh Bhatt

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.


Author(s):  
Héctor A. Sánchez-Hevia ◽  
Roberto Gil-Pita ◽  
Manuel Utrilla-Manso ◽  
Manuel Rosa-Zurera

AbstractThis paper analyses the performance of different types of Deep Neural Networks to jointly estimate age and identify gender from speech, to be applied in Interactive Voice Response systems available in call centres. Deep Neural Networks are used, because they have recently demonstrated discriminative and representation capabilities in a wide range of applications, including speech processing problems based on feature extraction and selection. Networks with different sizes are analysed to obtain information on how performance depends on the network architecture and the number of free parameters. The speech corpus used for the experiments is Mozilla’s Common Voice dataset, an open and crowdsourced speech corpus. The results are really good for gender classification, independently of the type of neural network, but improve with the network size. Regarding the classification by age groups, the combination of convolutional neural networks and temporal neural networks seems to be the best option among the analysed, and again, the larger the size of the network, the better the results. The results are promising for use in IVR systems, with the best systems achieving a gender identification error of less than 2% and a classification error by age group of less than 20%.


Electronics ◽  
2022 ◽  
Vol 11 (1) ◽  
pp. 168
Author(s):  
Mohsen Bakouri ◽  
Mohammed Alsehaimi ◽  
Husham Farouk Ismail ◽  
Khaled Alshareef ◽  
Ali Ganoun ◽  
...  

Many wheelchair people depend on others to control the movement of their wheelchairs, which significantly influences their independence and quality of life. Smart wheelchairs offer a degree of self-dependence and freedom to drive their own vehicles. In this work, we designed and implemented a low-cost software and hardware method to steer a robotic wheelchair. Moreover, from our method, we developed our own Android mobile app based on Flutter software. A convolutional neural network (CNN)-based network-in-network (NIN) structure approach integrated with a voice recognition model was also developed and configured to build the mobile app. The technique was also implemented and configured using an offline Wi-Fi network hotspot between software and hardware components. Five voice commands (yes, no, left, right, and stop) guided and controlled the wheelchair through the Raspberry Pi and DC motor drives. The overall system was evaluated based on a trained and validated English speech corpus by Arabic native speakers for isolated words to assess the performance of the Android OS application. The maneuverability performance of indoor and outdoor navigation was also evaluated in terms of accuracy. The results indicated a degree of accuracy of approximately 87.2% of the accurate prediction of some of the five voice commands. Additionally, in the real-time performance test, the root-mean-square deviation (RMSD) values between the planned and actual nodes for indoor/outdoor maneuvering were 1.721 × 10−5 and 1.743 × 10−5, respectively.


Author(s):  
Alaa Ehab Sakran ◽  
Mohsen Rashwan ◽  
Sherif Mahdy Abdou

In this paper, automatic segmentation system was built using the Kaldi toolkit at phoneme level for Quran verses data set with a total speech corpus of (80 hours) and its corresponding text corpus respectively, with a size of 1100 recorded Quran verses of 100 non-Arab reciters. Initiated with the extraction of Mel Frequency Cepstral Coefficients MFCCs, the proceedings of the building of Language Model LM and Acoustic Model AM training phase continued until the Deep Neural Network DNN level by selecting 770 waves (70 reciters). The testing of the system was done using 220 waves (20 reciters), and concluded with the selection of the development data set which was 280 waves (10 reciters). Comparison was implemented between automatic and manual segmentation, and the results obtained for the test set was 99% and for the Development set was 99% with Time Delay Neural Networks TDNN based acoustic modelling.


Author(s):  
Татьяна Николаевна Балабанова ◽  
Алексей Владимирович Болдышев ◽  
Сергей Вячеславович Уманец

В данной работе рассматривается речевой сигнал как набор фрагментов, содержащих речевые компоненты и фрагменты с шумами, соответствующие паузам между словами. Ставится задача по составлению решающей функции, способной принять или отвергнуть гипотезу об отсутствии речи в отрезке речевого сигнала. На основе субполосного метода для отрезка речевого сигнала составляется его распределение энергий по частотам. Для этого распределения в дальнейшем применяется процедура аппроксимации смесью радиально-базисными функциями (функциями Гаусса). Смесь представляет собой взвешенную сумму радиально-базисных функций и равномерно-распределённой составляющей. По соотношению максимальных значений компонент смеси составляется решающее правило. Для проведения вычислительного эксперимента вводится нелинейность «зона нечувствительности», выбор которой обусловлен особенностями электрической активности путей и центров слуховой системы. В работе приводится результат применения алгоритма определения пауз в речевом сигнале. В качестве рабочего материала использовалась база размеченных речевых фрагментов американского агентства передовых оборонных исследовательских проектов DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. Всего было обработано 100 звукозаписей, размер отрезка анализа был взят 9 миллисекунд, частота дискретизации 16000Гц. Для проверки работоспособности предлагаемого алгоритма были оценены ошибки первого рода «пропуск цели» — когда алгоритм не начал отмечать паузу, но такая отметка присутствует при ручной расстановке, а также ошибки второго рода «ложная тревога» — когда произошла ошибочная постановка паузы. Полученные в ходе вычислительных экспериментов результаты позволяются судить о достаточно высокой эффективности предлагаемого подхода для определения пауз в речевом сигнале.


Author(s):  
Shafkat Kibria ◽  
Ahnaf Mozib Samin ◽  
M. Humayon Kobir ◽  
M. Shahidur Rahman ◽  
M. Reza Selim ◽  
...  

Author(s):  
Linda Gaile ◽  

The research on the simultaneous interpreting process and the associated target and source languages requires both the oral source speeches and the simultaneous interpreting of the spoken source speeches into the target language. For a relatively short time now, researchers of translation and interpreting have been able to access digitized linguistic corpora, parallel and speech corpora of different language pairs, from which they can build their own purpose-oriented corpus of original and target-language oral texts. Furthermore, the built-up language corpus can be analysed qualitatively or quantitatively using different software and investigated for specific linguistic phenomena. This present article focuses on the benefits of data retrieval from digitalized language and speech corpora, which can be an important source of assistance for the analysis of the oral simultaneous interpretation target text. At the heart of this question is the European Parliament’s speeches corpus, from which authentic speeches in the source language (German) and simultaneous interpretation in the target language (Latvian) can be obtained to create a sub-corpus for the German-Latvian language pair. Among others, the question of which interpreting strategies can be used for simultaneous interpreting from German into Latvian is explored, and the application of EXMARaLDA Partitur-Editor software is presented, which allows to create a simultaneous transcription of the source language and the simultaneously interpreted target language as well as to develop a speech corpus.


Author(s):  
Shinnosuke Isobe ◽  
Ryuichi Hirose ◽  
Takumi Nishiwaki ◽  
Tomohiro Hattori ◽  
Satoshi Tamura ◽  
...  
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document