End-to-End Speech Recognition Using Recurrent Neural Network (RNN)

Proceedings of Intelligent Computing and Technologies Conference ◽

10.21467/proceedings.115.20 ◽

2021 ◽

Author(s):

Rene Avalloni de Morais ◽

Baidya Nath Saha

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Language Processing ◽

High Performance ◽

Short Term Memory ◽

Learning Algorithms ◽

Recognition System ◽

End To End ◽

Performance Computing

Deep learning algorithms have received dramatic progress in the area of natural language processing and automatic human speech recognition. However, the accuracy of the deep learning algorithms depends on the amount and quality of the data and training deep models requires high-performance computing resources. In this backdrop, this paper adresses an end-to-end speech recognition system where we finetune Mozilla DeepSpeech architecture using two different datasets: LibriSpeech clean dataset and Harvard speech dataset. We train Long Short Term Memory (LSTM) based deep Recurrent Neural Netowrk (RNN) models in Google Colab platform and use their GPU resources. Extensive experimental results demonstrate that Mozilla DeepSpeech model could be fine-tuned for different audio datasets to recognize speeches successfully.

Download Full-text

Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language

Symmetry ◽

10.3390/sym11020179 ◽

2019 ◽

Vol 11 (2) ◽

pp. 179 ◽

Cited By ~ 4

Author(s):

Chongchong Yu ◽

Yunbing Chen ◽

Yueqiao Li ◽

Meng Kang ◽

Shixuan Xu ◽

...

Keyword(s):

Speech Recognition ◽

Transfer Learning ◽

Short Term Memory ◽

Recognition System ◽

Language Recognition ◽

Low Resource ◽

End To End ◽

The Cross ◽

Hidden Layer ◽

Cross Language

To rescue and preserve an endangered language, this paper studied an end-to-end speech recognition model based on sample transfer learning for the low-resource Tujia language. From the perspective of the Tujia language international phonetic alphabet (IPA) label layer, using Chinese corpus as an extension of the Tujia language can effectively solve the problem of an insufficient corpus in the Tujia language, constructing a cross-language corpus and an IPA dictionary that is unified between the Chinese and Tujia languages. The convolutional neural network (CNN) and bi-directional long short-term memory (BiLSTM) network were used to extract the cross-language acoustic features and train shared hidden layer weights for the Tujia language and Chinese phonetic corpus. In addition, the automatic speech recognition function of the Tujia language was realized using the end-to-end method that consists of symmetric encoding and decoding. Furthermore, transfer learning was used to establish the model of the cross-language end-to-end Tujia language recognition system. The experimental results showed that the recognition error rate of the proposed model is 46.19%, which is 2.11% lower than the that of the model that only used the Tujia language data for training. Therefore, this approach is feasible and effective.

Download Full-text

Parallel and Scalable Deep Learning Algorithms for High Performance Computing Architectures

International Journal of Engineering Trends and Technology ◽

10.14445/22315381/ijett-v69i4p232 ◽

2021 ◽

Vol 69 (4) ◽

pp. 236-246

Author(s):

Sunil Pandey ◽

Naresh Kumar Nagwani ◽

Shrish Verma

Keyword(s):

Deep Learning ◽

High Performance Computing ◽

High Performance ◽

Learning Algorithms ◽

Performance Computing

Download Full-text

Speech Vision: An End-to-End Deep Learning-based Dysarthric Automatic Speech Recognition System

IEEE Transactions on Neural Systems and Rehabilitation Engineering ◽

10.1109/tnsre.2021.3076778 ◽

2021 ◽

pp. 1-1

Author(s):

Seyed Reza Shahamiri

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Automatic Speech Recognition System ◽

End To End

Download Full-text

Analysis of Parkinson’s Disease using Deep Learning and Word Embedding Models

Academic Perspective Procedia ◽

10.33793/acperpro.02.03.86 ◽

2019 ◽

Vol 2 (3) ◽

pp. 786-797

Author(s):

Feyza Cevik ◽

Zeynep Hilal Kilimci

Keyword(s):

Neural Networks ◽

Social Media ◽

Deep Learning ◽

Short Term Memory ◽

Learning Algorithms ◽

Economic Effects ◽

Word Embedding ◽

Social Media Platforms ◽

Accuracy Performance

Parkinson's disease is a common neurodegenerative neurological disorder, which affects the patient's quality of life, has significant social and economic effects, and is difficult to diagnose early due to the gradual appearance of symptoms. Examining the discussion of Parkinson&rsquo;s disease in social media platforms such as Twitter provides a platform where patients communicate each other in both diagnosis and treatment stage of the Parkinson&rsquo;s disease. The purpose of this work is to evaluate and compare the sentiment analysis of people about Parkinson's disease by using deep learning and word embedding models. To the best of our knowledge, this is the very first study to analyze Parkinson's disease from social media by using word embedding models and deep learning algorithms. In this study, Word2Vec, GloVe, and FastText are employed as word embedding models for the purpose of enriching tweets in terms of semantic, context, and syntax. Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory Networks (LSTMs) are implemented for the classification task. This study demonstrates the efficiency of using word embedding models and deep learning algorithms to understand the needs of patients&rsquo; and provide a valuable contribution to the treatment process by analyzing sentiments of them with 93.63% accuracy performance.

Download Full-text

Extracting Family History of Patients From Clinical Narratives: Exploring an End-to-End Solution With Deep Learning Models (Preprint)

10.2196/preprints.22982 ◽

2020 ◽

Author(s):

Xi Yang ◽

Hansi Zhang ◽

Xing He ◽

Jiang Bian ◽

Yonghui Wu

Keyword(s):

Deep Learning ◽

Family History ◽

Information Extraction ◽

Language Processing ◽

Conditional Random Fields ◽

Short Term Memory ◽

Majority Voting ◽

Learning Models ◽

Concept Extraction ◽

End To End

BACKGROUND Patients’ family history (FH) is a critical risk factor associated with numerous diseases. However, FH information is not well captured in the structured database but often documented in clinical narratives. Natural language processing (NLP) is the key technology to extract patients’ FH from clinical narratives. In 2019, the National NLP Clinical Challenge (n2c2) organized shared tasks to solicit NLP methods for FH information extraction. OBJECTIVE This study presents our end-to-end FH extraction system developed during the 2019 n2c2 open shared task as well as the new transformer-based models that we developed after the challenge. We seek to develop a machine learning–based solution for FH information extraction without task-specific rules created by hand. METHODS We developed deep learning–based systems for FH concept extraction and relation identification. We explored deep learning models including long short-term memory-conditional random fields and bidirectional encoder representations from transformers (BERT) as well as developed ensemble models using a majority voting strategy. To further optimize performance, we systematically compared 3 different strategies to use BERT output representations for relation identification. RESULTS Our system was among the top-ranked systems (3 out of 21) in the challenge. Our best system achieved micro-averaged F1 scores of 0.7944 and 0.6544 for concept extraction and relation identification, respectively. After challenge, we further explored new transformer-based models and improved the performances of both subtasks to 0.8249 and 0.6775, respectively. For relation identification, our system achieved a performance comparable to the best system (0.6810) reported in the challenge. CONCLUSIONS This study demonstrated the feasibility of utilizing deep learning methods to extract FH information from clinical narratives.

Download Full-text

Improving Deep Learning based Automatic Speech Recognition for Gujarati

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3483446 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-18

Author(s):

Deepang Raval ◽

Vyom Pathak ◽

Muktan Patel ◽

Brijesh Bhatt

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Short Term Memory ◽

Language Model ◽

Recognition System ◽

Processing Technique ◽

Speech Corpus ◽

Novel Approach ◽

Asr System

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.

Download Full-text

Extracting Family History of Patients From Clinical Narratives: Exploring an End-to-End Solution With Deep Learning Models

JMIR Medical Informatics ◽

10.2196/22982 ◽

2020 ◽

Vol 8 (12) ◽

pp. e22982

Author(s):

Xi Yang ◽

Hansi Zhang ◽

Xing He ◽

Jiang Bian ◽

Yonghui Wu

Keyword(s):

Deep Learning ◽

Family History ◽

Information Extraction ◽

Language Processing ◽

Conditional Random Fields ◽

Short Term Memory ◽

Majority Voting ◽

Learning Models ◽

Concept Extraction ◽

End To End

Background Patients’ family history (FH) is a critical risk factor associated with numerous diseases. However, FH information is not well captured in the structured database but often documented in clinical narratives. Natural language processing (NLP) is the key technology to extract patients’ FH from clinical narratives. In 2019, the National NLP Clinical Challenge (n2c2) organized shared tasks to solicit NLP methods for FH information extraction. Objective This study presents our end-to-end FH extraction system developed during the 2019 n2c2 open shared task as well as the new transformer-based models that we developed after the challenge. We seek to develop a machine learning–based solution for FH information extraction without task-specific rules created by hand. Methods We developed deep learning–based systems for FH concept extraction and relation identification. We explored deep learning models including long short-term memory-conditional random fields and bidirectional encoder representations from transformers (BERT) as well as developed ensemble models using a majority voting strategy. To further optimize performance, we systematically compared 3 different strategies to use BERT output representations for relation identification. Results Our system was among the top-ranked systems (3 out of 21) in the challenge. Our best system achieved micro-averaged F1 scores of 0.7944 and 0.6544 for concept extraction and relation identification, respectively. After challenge, we further explored new transformer-based models and improved the performances of both subtasks to 0.8249 and 0.6775, respectively. For relation identification, our system achieved a performance comparable to the best system (0.6810) reported in the challenge. Conclusions This study demonstrated the feasibility of utilizing deep learning methods to extract FH information from clinical narratives.

Download Full-text

Design Of A Voice Controlled Home Automation System Using Deep Learning Convolutional Neural Network (DL-CNN)

Telekontran : Jurnal Ilmiah Telekomunikasi, Kendali dan Elektronika Terapan ◽

10.34010/telekontran.v8i1.3078 ◽

2020 ◽

Vol 8 (1) ◽

pp. 57-73

Author(s):

Lery Sakti Ramba

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Background Noise ◽

Electronic Devices ◽

Recognition System ◽

Background Intensity ◽

Automation System ◽

Home Automation ◽

Speech Recognition System ◽

Home Automation System

The purpose of this research is to design home automation system that can be controlled using voice commands. This research was conducted by studying other research related to the topics in this research, discussing with competent parties, designing systems, testing systems, and conducting analyzes based on tests that have been done. In this research voice recognition system was designed using Deep Learning Convolutional Neural Networks (DL-CNN). The CNN model that has been designed will then be trained to recognize several kinds of voice commands. The result of this research is a speech recognition system that can be used to control several electronic devices connected to the system. The speech recognition system in this research has a 100% success rate in room conditions with background intensity of 24dB (silent), 67.67% in room conditions with 42dB background noise intensity, and only 51.67% in room conditions with background intensity noise 52dB (noisy). The percentage of the success of the speech recognition system in this research is strongly influenced by the intensity of background noise in a room. Therefore, to obtain optimal results, the speech recognition system in this research is more suitable for use in rooms with low intensity background noise.

Download Full-text

A Deep Learning based Arabic Script Recognition System: Benchmark on KHAT

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/3/3 ◽

2020 ◽

Vol 17 (3) ◽

pp. 299-305 ◽

Cited By ~ 1

Author(s):

Riaz Ahmad ◽

Saeeda Naz ◽

Muhammad Afzal ◽

Sheikh Rashid ◽

Marcus Liwicki ◽

...

Keyword(s):

Deep Learning ◽

Character Recognition ◽

Data Augmentation ◽

Short Term Memory ◽

Recognition System ◽

Learning Approach ◽

Arabic Text ◽

Data Set ◽

Processing Step ◽

Handwritten Arabic

This paper presents a deep learning benchmark on a complex dataset known as KFUPM Handwritten Arabic TexT (KHATT). The KHATT data-set consists of complex patterns of handwritten Arabic text-lines. This paper contributes mainly in three aspects i.e., (1) pre-processing, (2) deep learning based approach, and (3) data-augmentation. The pre-processing step includes pruning of white extra spaces plus de-skewing the skewed text-lines. We deploy a deep learning approach based on Multi-Dimensional Long Short-Term Memory (MDLSTM) networks and Connectionist Temporal Classification (CTC). The MDLSTM has the advantage of scanning the Arabic text-lines in all directions (horizontal and vertical) to cover dots, diacritics, strokes and fine inflammation. The data-augmentation with a deep learning approach proves to achieve better and promising improvement in results by gaining 80.02% Character Recognition (CR) over 75.08% as baseline.

Download Full-text

Hierarchical Phoneme Classification for Improved Speech Recognition

Applied Sciences ◽

10.3390/app11010428 ◽

2021 ◽

Vol 11 (1) ◽

pp. 428

Author(s):

Donghoon Oh ◽

Jeong-Sik Park ◽

Ji-Hwan Kim ◽

Gil-Jin Jang

Keyword(s):

Speech Recognition ◽

Language Processing ◽

Confusion Matrix ◽

Critical Factor ◽

Recognition System ◽

Classification Performance ◽

Language Models ◽

Successful Implementation ◽

Phoneme Classification ◽

Improved Performance

Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-of-the-art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement.

Download Full-text