Bidirectional deep architecture for Arabic speech recognition

AbstractNowadays, the real life constraints necessitates controlling modern machines using human intervention by means of sensorial organs. The voice is one of the human senses that can control/monitor modern interfaces. In this context, Automatic Speech Recognition is principally used to convert natural voice into computer text as well as to perform an action based on the instructions given by the human. In this paper, we propose a general framework for Arabic speech recognition that uses Long Short-Term Memory (LSTM) and Neural Network (Multi-Layer Perceptron: MLP) classifier to cope with the nonuniform sequence length of the speech utterances issued fromboth feature extraction techniques, (1)Mel Frequency Cepstral Coefficients MFCC (static and dynamic features), (2) the Filter Banks (FB) coefficients. The neural architecture can recognize the isolated Arabic speech via classification technique. The proposed system involves, first, extracting pertinent features from the natural speech signal using MFCC (static and dynamic features) and FB. Next, the extracted features are padded in order to deal with the non-uniformity of the sequences length. Then, a deep architecture represented by a recurrent LSTM or GRU (Gated Recurrent Unit) architectures are used to encode the sequences of MFCC/FB features as a fixed size vector that will be introduced to a Multi-Layer Perceptron network (MLP) to perform the classification (recognition). The proposed system is assessed using two different databases, the first one concerns the spoken digit recognition where a comparison with other related works in the literature is performed, whereas the second one contains the spoken TV commands. The obtained results show the superiority of the proposed approach.

Download Full-text

Qualitative Analysis of PLP in LSTM for Bangla Speech Recognition

The International journal of Multimedia & Its Applications ◽

10.5121/ijma.2020.12501 ◽

2020 ◽

Vol 12 (5) ◽

pp. 1-8

Author(s):

Nahyan Al Mahmud ◽

Shahfida Amjad Munni

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Linear Prediction ◽

Short Term Memory ◽

Acoustic Features ◽

Linear Predictive Coding ◽

Acoustic Feature ◽

Mel Frequency Cepstral Coefficients ◽

Bhattacharyya Distance ◽

Perceptual Linear Prediction

The performance of various acoustic feature extraction methods has been compared in this work using Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic features are a series of vectors that represents the speech signals. They can be classified in either words or sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) have also been used. These two methods closely resemble the human auditory system. These feature vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to investigate the nature of those acoustic features.

Download Full-text

Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.3465 ◽

2020 ◽

Vol 10 (2) ◽

pp. 5547-5553

Author(s):

A. A. Alasadi ◽

T. H. Aldhayni ◽

R. R. Deshmukh ◽

A. H. Alahmadi ◽

A. S. Alshebami

Keyword(s):

Feature Extraction ◽

Speech Recognition ◽

Group Delay ◽

Recognition System ◽

Support Vector ◽

Speech Recognition System ◽

Mel Frequency Cepstral Coefficients ◽

Delay Function ◽

Cepstral Coefficients ◽

Arabic Speech Recognition

This paper studies three feature extraction methods, Mel-Frequency Cepstral Coefficients (MFCC), Power-Normalized Cepstral Coefficients (PNCC), and Modified Group Delay Function (ModGDF) for the development of an Automated Speech Recognition System (ASR) in Arabic. The Support Vector Machine (SVM) algorithm processed the obtained features. These feature extraction algorithms extract speech or voice characteristics and process the group delay functionality calculated straight from the voice signal. These algorithms were deployed to extract audio forms from Arabic speakers. PNCC provided the best recognition results in Arabic speech in comparison with the other methods. Simulation results showed that PNCC and ModGDF were more accurate than MFCC in Arabic speech recognition.

Download Full-text

Arabic Speech Recognition by Bionic Wavelet Transform and MFCC using a Multi Layer Perceptron

2012 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT) ◽

10.1109/setit.2012.6482017 ◽

2012 ◽

Cited By ~ 5

Author(s):

Mohammed Ben Nasr ◽

Mourad Talbi ◽

Adnane Cherif

Keyword(s):

Wavelet Transform ◽

Speech Recognition ◽

Multi Layer Perceptron ◽

Arabic Speech Recognition

Download Full-text

Long Short-term Memory for Tibetan Speech Recognition

2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) ◽

10.1109/itnec48623.2020.9084681 ◽

2020 ◽

Author(s):

Weizhe Wang ◽

Ziyan Chen ◽

Hongwu Yang

Keyword(s):

Speech Recognition ◽

Short Term Memory ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory

Download Full-text

Evaluation of Mixed Deep Neural Networks for Reverberant Speech Enhancement

Biomimetics ◽

10.3390/biomimetics5010001 ◽

2019 ◽

Vol 5 (1) ◽

pp. 1 ◽

Cited By ~ 1

Author(s):

Michelle Gutiérrez-Muñoz ◽

Astryd González-Salazar ◽

Marvin Coto-Jiménez

Keyword(s):

Neural Networks ◽

Short Term Memory ◽

Computational Cost ◽

Real Life ◽

Fixed Number ◽

Training Procedure ◽

Statistical Validation ◽

Significant Drop ◽

Training Time ◽

Important Solution

Speech signals are degraded in real-life environments, as a product of background noise or other factors. The processing of such signals for voice recognition and voice analysis systems presents important challenges. One of the conditions that make adverse quality difficult to handle in those systems is reverberation, produced by sound wave reflections that travel from the source to the microphone in multiple directions. To enhance signals in such adverse conditions, several deep learning-based methods have been proposed and proven to be effective. Recently, recurrent neural networks, especially those with long short-term memory (LSTM), have presented surprising results in tasks related to time-dependent processing of signals, such as speech. One of the most challenging aspects of LSTM networks is the high computational cost of the training procedure, which has limited extended experimentation in several cases. In this work, we present a proposal to evaluate the hybrid models of neural networks to learn different reverberation conditions without any previous information. The results show that some combinations of LSTM and perceptron layers produce good results in comparison to those from pure LSTM networks, given a fixed number of layers. The evaluation was made based on quality measurements of the signal’s spectrum, the training time of the networks, and statistical validation of results. In total, 120 artificial neural networks of eight different types were trained and compared. The results help to affirm the fact that hybrid networks represent an important solution for speech signal enhancement, given that reduction in training time is on the order of 30%, in processes that can normally take several days or weeks, depending on the amount of data. The results also present advantages in efficiency, but without a significant drop in quality.

Download Full-text

Environmental effects on reliability and accuracy of MFCC based voice recognition for industrial human-robot-interaction

Proceedings of the Institution of Mechanical Engineers Part B Journal of Engineering Manufacture ◽

10.1177/09544054211014492 ◽

2021 ◽

pp. 095440542110144

Author(s):

B Birch ◽

CA Griffiths ◽

A Morgan

Keyword(s):

Speech Recognition ◽

Voice Recognition ◽

Human Robot Interaction ◽

Hole Drilling ◽

Time Warping ◽

Mel Frequency Cepstral Coefficients ◽

Robot Interaction ◽

Extraction Algorithm ◽

Dynamic Time ◽

Manufacturing Environments

Collaborative robots are becoming increasingly important for advanced manufacturing processes. The purpose of this paper is to determine the capability of a novel Human-Robot-interface to be used for machine hole drilling. Using a developed voice activation system, environmental factors on speech recognition accuracy are considered. The research investigates the accuracy of a Mel Frequency Cepstral Coefficients-based feature extraction algorithm which uses Dynamic Time Warping to compare an utterance to a limited, user-dependent dictionary. The developed Speech Recognition method allows for Human-Robot-Interaction using a novel integration method between the voice recognition and robot. The system can be utilised in many manufacturing environments where robot motions can be coupled to voice inputs rather than using time consuming physical interfaces. However, there are limitations to uptake in industries where the volume of background machine noise is high.

Download Full-text

Syllable-based automatic arabic speech recognition in noisy enviroment

2008 International Conference on Audio, Language and Image Processing ◽

10.1109/icalip.2008.4590209 ◽

2008 ◽

Cited By ~ 3

Author(s):

Mohamed M. Azmi ◽

Hesham Tolba

Keyword(s):

Speech Recognition ◽

Arabic Speech Recognition

Download Full-text

Morphology-based language modeling for arabic speech recognition

10.21437/interspeech.2004-495 ◽

2004 ◽

Author(s):

Dimitra Vergyri ◽

Katrin Kirchhoff ◽

Kevin Duh ◽

Andreas Stolcke

Keyword(s):

Speech Recognition ◽

Language Modeling ◽

Arabic Speech Recognition

Download Full-text

Automatic Recognition of the Lighting Whistler waves from the Wave Data of SCM Boarded on ZH-1 satellite

10.5194/egusphere-egu21-11024 ◽

2021 ◽

Author(s):

Jing Yuan ◽

Zijie Wang ◽

Dehe Yang ◽

Qiao Wang ◽

Zeren Zima ◽

...

Keyword(s):

Short Term Memory ◽

Satellite Observation ◽

Original Data ◽

Electromagnetic Environment ◽

Mel Frequency Cepstral Coefficients ◽

Frequency Profile ◽

Time Frequency ◽

Wave Data ◽

Audio Clip ◽

Long Short Term Memory

Lightning whistlers, found frequently in electromagnetic satellite observation, are the important tool to study electromagnetic environment of the earth space. With the increasing data from electromagnetic satellites, a considerable amount of time and human efforts are needed to detect lightning whistlers from these tremendous data. In recent years, algorithms for lightning whistlers automatic detection have been conducted. However, these methods can only work in the time-frequency profile (image) of the electromagnetic satellites data with two major limitations: vast storage memory for the time-frequency profile (image) and expensive computation for employing the methods to detect automatically the whistler from the time-frequency profile. These limitations hinder the methods work efficiently on ZH-1 satellite. To overcome the limitations and realize the real-time whistler detection automatically on board satellite, we propose a novel algorithm for detecting lightning whistler from the original observed data without transforming it to the time-frequency profile (image).The motivation is that the frequency of lightning whistler is in the audio frequency range. It encourages us to utilize the speech recognition techniques to recognize the whistler in the original data \of SCM VLF Boarded on ZH-1. Firstly, we averagely move a 0.16 seconds window on the original data to obtain the patch data as the audio clip. Secondly, we extract the Mel-frequency cepstral coefficients&#160;(MFCCs) of the patch data as a type of&#160;cepstral&#160;representation of the&#160;audio&#160;clip. Thirdly, the MFCCs are input to the Long Short-Term Memory (LSTM) recurrent neutral networks to classification. To evaluate the proposed method, we construct the dataset composed of 10000 segments of SCM wave data observed from ZH-1 satellite(5000 segments which involving whistler and 5000 segments without any whistler). The proposed method can achieve 84% accuracy, 87% in recall, 85.6% in F1score.Furthermore, it can save more than 126.7MB and 0.82 seconds compared to the method employing the YOLOv3 neutral network for detecting whistler on each time-frequency profile.&#160;Key words: ZH-1 satellite, SCM,lightning whistler, MFCC, LSTM

Download Full-text