scholarly journals Bidirectional deep architecture for Arabic speech recognition

2019 ◽  
Vol 9 (1) ◽  
pp. 92-102
Author(s):  
Naima Zerari ◽  
Samir Abdelhamid ◽  
Hassen Bouzgou ◽  
Christian Raymond

AbstractNowadays, the real life constraints necessitates controlling modern machines using human intervention by means of sensorial organs. The voice is one of the human senses that can control/monitor modern interfaces. In this context, Automatic Speech Recognition is principally used to convert natural voice into computer text as well as to perform an action based on the instructions given by the human. In this paper, we propose a general framework for Arabic speech recognition that uses Long Short-Term Memory (LSTM) and Neural Network (Multi-Layer Perceptron: MLP) classifier to cope with the nonuniform sequence length of the speech utterances issued fromboth feature extraction techniques, (1)Mel Frequency Cepstral Coefficients MFCC (static and dynamic features), (2) the Filter Banks (FB) coefficients. The neural architecture can recognize the isolated Arabic speech via classification technique. The proposed system involves, first, extracting pertinent features from the natural speech signal using MFCC (static and dynamic features) and FB. Next, the extracted features are padded in order to deal with the non-uniformity of the sequences length. Then, a deep architecture represented by a recurrent LSTM or GRU (Gated Recurrent Unit) architectures are used to encode the sequences of MFCC/FB features as a fixed size vector that will be introduced to a Multi-Layer Perceptron network (MLP) to perform the classification (recognition). The proposed system is assessed using two different databases, the first one concerns the spoken digit recognition where a comparison with other related works in the literature is performed, whereas the second one contains the spoken TV commands. The obtained results show the superiority of the proposed approach.

2020 ◽  
Vol 12 (5) ◽  
pp. 1-8
Author(s):  
Nahyan Al Mahmud ◽  
Shahfida Amjad Munni

The performance of various acoustic feature extraction methods has been compared in this work using Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic features are a series of vectors that represents the speech signals. They can be classified in either words or sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) have also been used. These two methods closely resemble the human auditory system. These feature vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to investigate the nature of those acoustic features.


2020 ◽  
Vol 10 (2) ◽  
pp. 5547-5553
Author(s):  
A. A. Alasadi ◽  
T. H. Aldhayni ◽  
R. R. Deshmukh ◽  
A. H. Alahmadi ◽  
A. S. Alshebami

This paper studies three feature extraction methods, Mel-Frequency Cepstral Coefficients (MFCC), Power-Normalized Cepstral Coefficients (PNCC), and Modified Group Delay Function (ModGDF) for the development of an Automated Speech Recognition System (ASR) in Arabic. The Support Vector Machine (SVM) algorithm processed the obtained features. These feature extraction algorithms extract speech or voice characteristics and process the group delay functionality calculated straight from the voice signal. These algorithms were deployed to extract audio forms from Arabic speakers. PNCC provided the best recognition results in Arabic speech in comparison with the other methods. Simulation results showed that PNCC and ModGDF were more accurate than MFCC in Arabic speech recognition.


Biomimetics ◽  
2019 ◽  
Vol 5 (1) ◽  
pp. 1 ◽  
Author(s):  
Michelle Gutiérrez-Muñoz ◽  
Astryd González-Salazar ◽  
Marvin Coto-Jiménez

Speech signals are degraded in real-life environments, as a product of background noise or other factors. The processing of such signals for voice recognition and voice analysis systems presents important challenges. One of the conditions that make adverse quality difficult to handle in those systems is reverberation, produced by sound wave reflections that travel from the source to the microphone in multiple directions. To enhance signals in such adverse conditions, several deep learning-based methods have been proposed and proven to be effective. Recently, recurrent neural networks, especially those with long short-term memory (LSTM), have presented surprising results in tasks related to time-dependent processing of signals, such as speech. One of the most challenging aspects of LSTM networks is the high computational cost of the training procedure, which has limited extended experimentation in several cases. In this work, we present a proposal to evaluate the hybrid models of neural networks to learn different reverberation conditions without any previous information. The results show that some combinations of LSTM and perceptron layers produce good results in comparison to those from pure LSTM networks, given a fixed number of layers. The evaluation was made based on quality measurements of the signal’s spectrum, the training time of the networks, and statistical validation of results. In total, 120 artificial neural networks of eight different types were trained and compared. The results help to affirm the fact that hybrid networks represent an important solution for speech signal enhancement, given that reduction in training time is on the order of 30%, in processes that can normally take several days or weeks, depending on the amount of data. The results also present advantages in efficiency, but without a significant drop in quality.


Author(s):  
B Birch ◽  
CA Griffiths ◽  
A Morgan

Collaborative robots are becoming increasingly important for advanced manufacturing processes. The purpose of this paper is to determine the capability of a novel Human-Robot-interface to be used for machine hole drilling. Using a developed voice activation system, environmental factors on speech recognition accuracy are considered. The research investigates the accuracy of a Mel Frequency Cepstral Coefficients-based feature extraction algorithm which uses Dynamic Time Warping to compare an utterance to a limited, user-dependent dictionary. The developed Speech Recognition method allows for Human-Robot-Interaction using a novel integration method between the voice recognition and robot. The system can be utilised in many manufacturing environments where robot motions can be coupled to voice inputs rather than using time consuming physical interfaces. However, there are limitations to uptake in industries where the volume of background machine noise is high.


2004 ◽  
Author(s):  
Dimitra Vergyri ◽  
Katrin Kirchhoff ◽  
Kevin Duh ◽  
Andreas Stolcke

2021 ◽  
Author(s):  
Jing Yuan ◽  
Zijie Wang ◽  
Dehe Yang ◽  
Qiao Wang ◽  
Zeren Zima ◽  
...  

<p>Lightning whistlers, found frequently in electromagnetic satellite observation, are the important tool to study electromagnetic environment of the earth space. With the increasing data from electromagnetic satellites, a considerable amount of time and human efforts are needed to detect lightning whistlers from these tremendous data. In recent years, algorithms for lightning whistlers automatic detection have been conducted. However, these methods can only work in the time-frequency profile (image) of the electromagnetic satellites data with two major limitations: vast storage memory for the time-frequency profile (image) and expensive computation for employing the methods to detect automatically the whistler from the time-frequency profile. These limitations hinder the methods work efficiently on ZH-1 satellite. To overcome the limitations and realize the real-time whistler detection automatically on board satellite, we propose a novel algorithm for detecting lightning whistler from the original observed data without transforming it to the time-frequency profile (image).</p><p>The motivation is that the frequency of lightning whistler is in the audio frequency range. It encourages us to utilize the speech recognition techniques to recognize the whistler in the original data \of SCM VLF Boarded on ZH-1. Firstly, we averagely move a 0.16 seconds window on the original data to obtain the patch data as the audio clip. Secondly, we extract the Mel-frequency cepstral coefficients (MFCCs) of the patch data as a type of cepstral representation of the audio clip. Thirdly, the MFCCs are input to the Long Short-Term Memory (LSTM) recurrent neutral networks to classification. To evaluate the proposed method, we construct the dataset composed of 10000 segments of SCM wave data observed from ZH-1 satellite(5000 segments which involving whistler and 5000 segments without any whistler). The proposed method can achieve 84% accuracy, 87% in recall, 85.6% in F1score.Furthermore, it can save more than 126.7MB and 0.82 seconds compared to the method employing the YOLOv3 neutral network for detecting whistler on each time-frequency profile.</p><p> </p><p>Key words: ZH-1 satellite, SCM,lightning whistler, MFCC, LSTM</p>


Sign in / Sign up

Export Citation Format

Share Document