Qualitative Analysis of PLP in LSTM for Bangla Speech Recognition

The performance of various acoustic feature extraction methods has been compared in this work using Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic features are a series of vectors that represents the speech signals. They can be classified in either words or sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) have also been used. These two methods closely resemble the human auditory system. These feature vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to investigate the nature of those acoustic features.

Download Full-text

Genetic Algorithm for Combined Speaker and Speech Recognition using Deep Neural Networks

Journal of Telecommunications and Information Technology ◽

10.26636/jtit.2018.119617 ◽

2018 ◽

Vol 2 ◽

pp. 23-31 ◽

Cited By ~ 1

Author(s):

Gurpreet Kaur ◽

Mohit Srivastava ◽

Amod Kumar

Keyword(s):

Genetic Algorithm ◽

Feature Extraction ◽

Speech Recognition ◽

Speaker Recognition ◽

Linear Prediction ◽

Rate Sensitivity ◽

Second Phase ◽

Linear Predictive Coding ◽

Mel Frequency Cepstral Coefficients ◽

Perceptual Linear Prediction

Huge growth is observed in the speech and speaker recognition ﬁeld due to many artiﬁcial intelligence algorithms being applied. Speech is used to convey messages via the language being spoken, emotions, gender and speaker identity. Many real applications in healthcare are based upon speech and speaker recognition, e.g. a voice-controlled wheelchair helps control the chair. In this paper, we use a genetic algorithm (GA) for combined speaker and speech recognition, relying on optimized Mel Frequency Cepstral Coeﬃcient (MFCC) speech features, and classiﬁcation is performed using a Deep Neural Network (DNN). In the ﬁrst phase, feature extraction using MFCC is executed. Then, feature optimization is performed using GA. In the second phase training is conducted using DNN. Evaluation and validation of the proposed work model is done by setting a real environment, and eﬃciency is calculated on the basis of such parameters as accuracy, precision rate, recall rate, sensitivity, and speciﬁcity. Also, this paper presents an evaluation of such feature extraction methods as linear predictive coding coeﬃcient (LPCC), perceptual linear prediction (PLP), mel frequency cepstral coefﬁcients (MFCC) and relative spectra ﬁltering (RASTA), with all of them used for combined speaker and speech recognition systems. A comparison of diﬀerent methods based on existing techniques for both clean and noisy environments is made as well.

Download Full-text

An Encapsulation of Vital Non-Linear Frequency Features for Various Speech Applications

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.8666 ◽

2020 ◽

Vol 17 (1) ◽

pp. 303-307

Author(s):

S. Lalitha ◽

Deepa Gupta

Keyword(s):

Speech Recognition ◽

Linear Prediction ◽

Performance Metrics ◽

Speaker Identification ◽

Frequency Estimation ◽

Mel Frequency Cepstral Coefficients ◽

Environment Type ◽

Perceptual Linear Prediction ◽

Frequency Features ◽

Selection Of

Mel Frequency Cepstral Coefficients (MFCCs) and Perceptual linear prediction coefficients (PLPCs) are widely casted nonlinear vocal parameters in majority of the speaker identification, speaker and speech recognition techniques as well in the field of emotion recognition. Post 1980s, significant exertions are put forth on for the progress of these features. Considerations like the usage of appropriate frequency estimation approaches, proposal of appropriate filter banks, and selection of preferred features perform a vital part for the strength of models employing these features. This article projects an overview of MFCC and PLPC features for different speech applications. The insights such as performance metrics of accuracy, background environment, type of data, and size of features are inspected and concise with the corresponding key references. Adding more to this, the advantages and shortcomings of these features have been discussed. This background work will hopefully contribute to floating a heading step in the direction of the enhancement of MFCC and PLPC with respect to novelty, raised levels of accuracy, and lesser complexity.

Download Full-text

Features Extraction for Lhasa Tibetan Speech Recognition

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.571-572.205 ◽

2014 ◽

Vol 571-572 ◽

pp. 205-208

Author(s):

Guan Yu Li ◽

Hong Zhi Yu ◽

Yong Hong Li ◽

Ning Ma

Keyword(s):

Speech Recognition ◽

Linear Prediction ◽

Recognition System ◽

Continuous Speech Recognition ◽

Mel Frequency Cepstral Coefficients ◽

Linear Prediction Coefficient ◽

Speech Feature ◽

Perceptual Linear Prediction ◽

Prediction Coefficient ◽

Speech Feature Extraction

Speech feature extraction is discussed. Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction coefficient (PLP) method is analyzed. These two types of features are extracted in Lhasa large vocabulary continuous speech recognition system. Then the recognition results are compared.

Download Full-text

Exploiting deep neural network and long short-term memory method-ologies in bioacoustic classification of LPC-based features

PLoS ONE ◽

10.1371/journal.pone.0259140 ◽

2021 ◽

Vol 16 (12) ◽

pp. e0259140

Author(s):

Cihun-Siyong Alex Gong ◽

Chih-Hui Simon Su ◽

Kuo-Wei Chao ◽

Yi-Chu Chao ◽

Chin-Kai Su ◽

...

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Short Term Memory ◽

Principal Component ◽

Linear Predictive Coding ◽

Acoustic Feature ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory

The research describes the recognition and classification of the acoustic characteristics of amphibians using deep learning of deep neural network (DNN) and long short-term memory (LSTM) for biological applications. First, original data is collected from 32 species of frogs and 3 species of toads commonly found in Taiwan. Secondly, two digital filtering algorithms, linear predictive coding (LPC) and Mel-frequency cepstral coefficient (MFCC), are respectively used to collect amphibian bioacoustic features and construct the datasets. In addition, principal component analysis (PCA) algorithm is applied to achieve dimensional reduction of the training model datasets. Next, the classification of amphibian bioacoustic features is accomplished through the use of DNN and LSTM. The Pytorch platform with a GPU processor (NVIDIA GeForce GTX 1050 Ti) realizes the calculation and recognition of the acoustic feature classification results. Based on above-mentioned two algorithms, the sound feature datasets are classified and effectively summarized in several classification result tables and graphs for presentation. The results of the classification experiment of the different features of bioacoustics are verified and discussed in detail. This research seeks to extract the optimal combination of the best recognition and classification algorithms in all experimental processes.

Download Full-text

Campus Violence Detection Based on Artificial Intelligent Interpretation of Surveillance Video Sequences

Remote Sensing ◽

10.3390/rs13040628 ◽

2021 ◽

Vol 13 (4) ◽

pp. 628

Author(s):

Liang Ye ◽

Tong Liu ◽

Tian Han ◽

Hany Ferdinando ◽

Tapio Seppänen ◽

...

Keyword(s):

Neural Network ◽

Recognition Accuracy ◽

Role Playing ◽

School Bullying ◽

Image Features ◽

Campus Violence ◽

Surveillance Video ◽

Acoustic Features ◽

Mel Frequency Cepstral Coefficients ◽

Violence Detection

Campus violence is a common social phenomenon all over the world, and is the most harmful type of school bullying events. As artificial intelligence and remote sensing techniques develop, there are several possible methods to detect campus violence, e.g., movement sensor-based methods and video sequence-based methods. Sensors and surveillance cameras are used to detect campus violence. In this paper, the authors use image features and acoustic features for campus violence detection. Campus violence data are gathered by role-playing, and 4096-dimension feature vectors are extracted from every 16 frames of video images. The C3D (Convolutional 3D) neural network is used for feature extraction and classification, and an average recognition accuracy of 92.00% is achieved. Mel-frequency cepstral coefficients (MFCCs) are extracted as acoustic features, and three speech emotion databases are involved. The C3D neural network is used for classification, and the average recognition accuracies are 88.33%, 95.00%, and 91.67%, respectively. To solve the problem of evidence conflict, the authors propose an improved Dempster–Shafer (D–S) algorithm. Compared with existing D–S theory, the improved algorithm increases the recognition accuracy by 10.79%, and the recognition accuracy can ultimately reach 97.00%.

Download Full-text

Convolutional Grid Long Short-Term Memory Recurrent Neural Network for Automatic Speech Recognition

Communications in Computer and Information Science - Neural Information Processing ◽

10.1007/978-3-030-36802-9_76 ◽

2019 ◽

pp. 718-726

Author(s):

Jiabin Xue ◽

Tieran Zheng ◽

Jiqing Han

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Recurrent Neural Network ◽

Short Term Memory ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory

Download Full-text

Discrete Wavelet Transform & Linear Prediction Coding Based Method for Speech Recognition via Neural Network

Discrete Wavelet Transforms - Biomedical Applications ◽

10.5772/20978 ◽

2011 ◽

Author(s):

K. Daqrouq ◽

A.R. Al-Qawasmi ◽

K.Y. Al ◽

T. Abu

Keyword(s):

Neural Network ◽

Wavelet Transform ◽

Speech Recognition ◽

Discrete Wavelet Transform ◽

Linear Prediction ◽

Discrete Wavelet ◽

Linear Prediction Coding ◽

Prediction Coding

Download Full-text

Acoustic comparison of electronics disguised voice using Different semitones

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.16.11502 ◽

2018 ◽

Vol 7 (2.16) ◽

pp. 98 ◽

Cited By ~ 2

Author(s):

Mahesh K. Singh ◽

A K. Singh ◽

Narendra Singh

Keyword(s):

Support Vector Machine ◽

Acoustic Analysis ◽

Speaker Identification ◽

Support Vector ◽

Acoustic Features ◽

Acoustic Feature ◽

Mel Frequency Cepstral Coefficients ◽

Identification Rate ◽

Normal Voice ◽

Feature Based

This paper emphasizes an algorithm that is based on acoustic analysis of electronics disguised voice. Proposed work is given a comparative analysis of all acoustic feature and its statistical coefficients. Acoustic features are computed by Mel-frequency cepstral coefficients (MFCC) method and compare with a normal voice and disguised voice by different semitones. All acoustic features passed through the feature based classifier and detected the identification rate of all type of electronically disguised voice. There are two types of support vector machine (SVM) and decision tree (DT) classifiers are used for speaker identification in terms of classification efficiency of electronically disguised voice by different semitones.

Download Full-text

Discriminative Training Using Noise Robust Integrated Features and Refined HMM Modeling

Journal of Intelligent Systems ◽

10.1515/jisys-2017-0618 ◽

2018 ◽

Vol 29 (1) ◽

pp. 327-344 ◽

Cited By ~ 3

Author(s):

Mohit Dua ◽

Rajesh Kumar Aggarwal ◽

Mantosh Biswas

Keyword(s):

Feature Extraction ◽

Linear Prediction ◽

Extraction Methods ◽

Discriminative Training ◽

Mel Frequency Cepstral Coefficients ◽

Maximum Mutual Information ◽

Perceptual Linear Prediction ◽

Noise Robust ◽

Minimum Phone Error ◽

Asr System

Abstract The classical approach to build an automatic speech recognition (ASR) system uses different feature extraction methods at the front end and various parameter classification techniques at the back end. The Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) techniques are the conventional approaches used for many years for feature extraction, and the hidden Markov model (HMM) has been the most obvious selection for feature classification. However, the performance of MFCC-HMM and PLP-HMM-based ASR system degrades in real-time environments. The proposed work discusses the implementation of discriminatively trained Hindi ASR system using noise robust integrated features and refined HMM model. It sequentially combines MFCC with PLP and MFCC with gammatone-frequency cepstral coefficient (GFCC) to obtain MF-PLP and MF-GFCC integrated feature vectors, respectively. The HMM parameters are refined using genetic algorithm (GA) and particle swarm optimization (PSO). Discriminative training of acoustic model using maximum mutual information (MMI) and minimum phone error (MPE) is preformed to enhance the accuracy of the proposed system. The results show that discriminative training using MPE with MF-GFCC integrated feature vector and PSO-HMM parameter refinement gives significantly better results than the other implemented techniques.

Download Full-text

Audio Event Identification and Classification for Cricket Sports using LSTM

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d9462.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 9924-9927

Keyword(s):

Neural Network ◽

Short Term Memory ◽

Research Work ◽

Short Term ◽

Mel Frequency Cepstral Coefficients ◽

Event Classification ◽

Event Identification ◽

Audio Event ◽

Long Short Term Memory ◽

Lstm Network

Audio event identification is an emerging research topic to augment the automation of audio tagging, context-based audio event retrieval, audio surveillance and much more. In this research work, audio event classification for cricket commentary is done by using long short term memory (LSTM) neural network. Mel-frequency cepstral coefficients (MFCC) features are extracted from the audio commentary and trained with LSTM neural network. The trained LSTM network is validated and attained an accuracy of 95%.

Download Full-text