Spectral-Warping Based Noise-Robust Enhanced Children ASR System

Abstract In real-life applications, noise originating from different sound sources modifies the characteristics of an input signal which affects the development of an enhanced ASR system. This contamination degrades the quality and comprehension of speech variables while impacting the performance of human-machine communication systems. This paper aims to minimise noise challenges by using a robust feature extraction methodology through introduction of an optimised filtering technique. Initially, the evaluations for enhancing input signals are constructed by using state transformation matrix and minimising a mean square error based upon the linear time variance techniques of Kalman and Adaptive Wiener Filtering. Consequently, Mel-frequency cepstral coefficients (MFCC), Linear Predictive Cepstral Coefficient (LPCC), RelAtive SpecTrAl-Perceptual Linear Prediction (RASTA-PLP) and Gammatone Frequency cepstral coefficient (GFCC) based feature extraction methods have been synthesised with their comparable efficiency in order to derive the adequate characteristics of a signal. It also handle the large-scale training complexities lies among the training and testing dataset. Consequently, the acoustic mismatch and linguistic complexity of large-scale variations lies within small set of speakers have been handle by utilising the Vocal Tract Length Normalization (VTLN) based warping of the test utterances. Furthermore, the spectral warping approach has been used by time reversing the samples inside a frame and passing them into the filter network corresponding to each frame. Finally, the overall Relative Improvement (RI) of 16.13% on 5-way perturbed spectral warped based noise augmented dataset through Wiener Filtering in comparison to other systems respectively.

Download Full-text

Discriminative Training Using Noise Robust Integrated Features and Refined HMM Modeling

Journal of Intelligent Systems ◽

10.1515/jisys-2017-0618 ◽

2018 ◽

Vol 29 (1) ◽

pp. 327-344 ◽

Cited By ~ 3

Author(s):

Mohit Dua ◽

Rajesh Kumar Aggarwal ◽

Mantosh Biswas

Keyword(s):

Feature Extraction ◽

Linear Prediction ◽

Extraction Methods ◽

Discriminative Training ◽

Mel Frequency Cepstral Coefficients ◽

Maximum Mutual Information ◽

Perceptual Linear Prediction ◽

Noise Robust ◽

Minimum Phone Error ◽

Asr System

Abstract The classical approach to build an automatic speech recognition (ASR) system uses different feature extraction methods at the front end and various parameter classification techniques at the back end. The Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) techniques are the conventional approaches used for many years for feature extraction, and the hidden Markov model (HMM) has been the most obvious selection for feature classification. However, the performance of MFCC-HMM and PLP-HMM-based ASR system degrades in real-time environments. The proposed work discusses the implementation of discriminatively trained Hindi ASR system using noise robust integrated features and refined HMM model. It sequentially combines MFCC with PLP and MFCC with gammatone-frequency cepstral coefficient (GFCC) to obtain MF-PLP and MF-GFCC integrated feature vectors, respectively. The HMM parameters are refined using genetic algorithm (GA) and particle swarm optimization (PSO). Discriminative training of acoustic model using maximum mutual information (MMI) and minimum phone error (MPE) is preformed to enhance the accuracy of the proposed system. The results show that discriminative training using MPE with MF-GFCC integrated feature vector and PSO-HMM parameter refinement gives significantly better results than the other implemented techniques.

Download Full-text

Genetic Algorithm for Combined Speaker and Speech Recognition using Deep Neural Networks

Journal of Telecommunications and Information Technology ◽

10.26636/jtit.2018.119617 ◽

2018 ◽

Vol 2 ◽

pp. 23-31 ◽

Cited By ~ 1

Author(s):

Gurpreet Kaur ◽

Mohit Srivastava ◽

Amod Kumar

Keyword(s):

Genetic Algorithm ◽

Feature Extraction ◽

Speech Recognition ◽

Speaker Recognition ◽

Linear Prediction ◽

Rate Sensitivity ◽

Second Phase ◽

Linear Predictive Coding ◽

Mel Frequency Cepstral Coefficients ◽

Perceptual Linear Prediction

Huge growth is observed in the speech and speaker recognition ﬁeld due to many artiﬁcial intelligence algorithms being applied. Speech is used to convey messages via the language being spoken, emotions, gender and speaker identity. Many real applications in healthcare are based upon speech and speaker recognition, e.g. a voice-controlled wheelchair helps control the chair. In this paper, we use a genetic algorithm (GA) for combined speaker and speech recognition, relying on optimized Mel Frequency Cepstral Coeﬃcient (MFCC) speech features, and classiﬁcation is performed using a Deep Neural Network (DNN). In the ﬁrst phase, feature extraction using MFCC is executed. Then, feature optimization is performed using GA. In the second phase training is conducted using DNN. Evaluation and validation of the proposed work model is done by setting a real environment, and eﬃciency is calculated on the basis of such parameters as accuracy, precision rate, recall rate, sensitivity, and speciﬁcity. Also, this paper presents an evaluation of such feature extraction methods as linear predictive coding coeﬃcient (LPCC), perceptual linear prediction (PLP), mel frequency cepstral coefﬁcients (MFCC) and relative spectra ﬁltering (RASTA), with all of them used for combined speaker and speech recognition systems. A comparison of diﬀerent methods based on existing techniques for both clean and noisy environments is made as well.

Download Full-text

Detecting Crew Alertness With Processed Speech

ASME/IEEE 2007 Joint Rail Conference and Internal Combustion Engine Division Spring Technical Conference ◽

10.1115/jrc/ice2007-40101 ◽

2007 ◽

Author(s):

Samuel K. Shimp ◽

Steve C. Southward ◽

Mehdi Ahmadian

Keyword(s):

Feature Extraction ◽

Monitoring System ◽

Vocal Tract ◽

Real Life ◽

Transportation Systems ◽

The Core ◽

Communication Signal ◽

Feature Extraction Algorithm ◽

Extraction Algorithm ◽

Speech Features

This paper proposes a solution for improving the safety of rail and other mass transportation systems through operator alertness monitoring. A non-invasive method of alertness monitoring through speech processing is presented. Speech analysis identifies measurable vocal tract changes due to fatigue and decreased speech rate due to decreased mental ability. Enabled by existing noise reduction technology, a system has been designed for measuring key speech features that are believed to correlate to alertness level. The features of interest are pitch, word intensity, pauses between words and phrases, and word rate. The purpose of this paper is to describe the overall alertness monitoring system design and then to show some experimental results for the core processing algorithm which extracts features from the speech. The feature extraction algorithm proposed here uses a new and simple technique to parse the continuous speech signal coming from the communication signal without using computationally demanding and error-prone word recognition techniques. Preliminary results on the core feature extraction algorithm indicate that words, phrases, and rates can be determined for relatively noise-free speech signals. Once the remainder of the overall alertness monitoring system is complete, it will be applied to real life recordings of train operators and will be subjected to clinical testing to determine alert and non-alert levels of the speech features of interest.

Download Full-text

An Investigation of Wavelet Average Framing LPC for Noisy Speaker Identification Environment

Mathematical Problems in Engineering ◽

10.1155/2015/598610 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10

Author(s):

Khaled Daqrouq ◽

Rami Al-Hmouz ◽

Abdullah Saeed Balamash ◽

Naif Alotaibi ◽

Elmar Noeth

Keyword(s):

Feature Extraction ◽

Wavelet Transforms ◽

Linear Prediction ◽

Vocal Tract ◽

Speaker Identification ◽

Experimental Result ◽

Identification System ◽

Noisy Environment ◽

Linear Prediction Coding ◽

Prediction Coding

In the presented research paper, an average framing linear prediction coding (AFLPC) method for a text-independent speaker identification system is studied. AFLPC was proposed in our previous work. Generally, linear prediction coding (LPC) has been used in numerous speech recognition tasks. Here, an investigative procedure was based on studying the AFLPC speaker recognition system in a noisy environment. In the stage of feature extraction, the speaker-specific resonances of the vocal tract were extracted using the AFLPC technique. In the phase of classification, a probabilistic neural network (PNN) and Bayesian classifier (BC) were applied for comparison. In the performed investigation, the quality of different wavelet transforms with AFLPC techniques was compared with each other. In addition, the capability analysis of the proposed system was examined for comparison with other systems suggested in the literature. In response to an achieved experimental result in a noisy environment, the PNN classifier could have a better performance with the fusion of wavelets and AFLPC as a feature extraction technique termed WFALPCF.

Download Full-text

Speech Recognition System for Isolated Tamil Words using Random Forest Algorithm

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.a1467.059120 ◽

2020 ◽

Vol 9 (1) ◽

pp. 2431-2435

Keyword(s):

Feature Extraction ◽

Speech Recognition ◽

Random Forest ◽

Vocal Tract ◽

Predictive Coding ◽

Recognition System ◽

Linear Predictive Coding ◽

Mel Frequency Cepstral Coefficients ◽

Speech Corpus ◽

Training Time

ASR is the use of system software and hardware based techniques to identify and process human voice. In this research, Tamil words are analyzed, segmented as syllables, followed by feature extraction and recognition. Syllables are segmented using short term energy and segmentation is done in order to minimize the corpus size. The algorithm for syllable segmentation works by performing the STE function of the continuous speech signal. The proposed approach for speech recognition uses the combination of Mel-Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC). MFCC features are used to extract a feature vector containing all information about the linguistic message. The LPC affords a robust, dependable and correct technique for estimating the parameters that signify the vocal tract system.LPC features can reduce the bit rate of speech (i.e reducing the measurement of transmitting signal).The combined feature extraction technique will minimize the size of transmitting signal. Then the proposed FE algorithm is evaluated on the speech corpus using the Random forest approach. Random forest is an effective algorithm which can build a reliable training model as its training time is less because the classifier works on the subset of features alone.

Download Full-text

AIM reference track - test site for V2X communication systems and cooperative ITS services

Journal of large-scale research facilities JLSRF ◽

10.17815/jlsrf-3-145 ◽

2017 ◽

Vol 3 ◽

Author(s):

Tobias Frankiewicz ◽

Alexander Burmeister

Keyword(s):

Communication Systems ◽

Large Scale ◽

Real Life ◽

Test Site ◽

Transportation Systems ◽

Transport Systems ◽

Intelligent Transport ◽

Scale Test ◽

German Aerospace ◽

Reference Track

Cooperative intelligent transport systems (C-ITS) based on Vehicle2X (V2X) communication are currently under development in the automotive industry and regarded to be in mass-production in the near future. In order to develop and test cooperative ITS services, the Institute of Transportation Systems of the German Aerospace Center (DLR) operates a large-scale test site in the city of Braunschweig, Germany. This research infrastructure facilitates test activities, measurements as well as evaluation activities for C-ITS in a real-life environment.

Download Full-text

Economic benefit of shale gas exploitation based on back propagation neural network

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189279 ◽

2020 ◽

Vol 39 (6) ◽

pp. 8823-8830

Author(s):

Jiafeng Li ◽

Hui Hu ◽

Xiang Li ◽

Qian Jin ◽

Tianhao Huang

Keyword(s):

Neural Network ◽

Shale Gas ◽

Bp Neural Network ◽

Large Scale ◽

Linear Prediction ◽

Back Propagation ◽

Back Propagation Neural Network ◽

Economic Benefits ◽

Gas Well ◽

Well Production

Under the influence of COVID-19, the economic benefits of shale gas development are greatly affected. With the large-scale development and utilization of shale gas in China, it is increasingly important to assess the economic impact of shale gas development. Therefore, this paper proposes a method for predicting the production of shale gas reservoirs, and uses back propagation (BP) neural network to nonlinearly fit reservoir reconstruction data to obtain shale gas well production forecasting models. Experiments show that compared with the traditional BP neural network, the proposed method can effectively improve the accuracy and stability of the prediction. There is a nonlinear correlation between reservoir reconstruction data and gas well production, which does not apply to traditional linear prediction methods

Download Full-text

Multi Disease-Prediction Framework Using Hybrid Deep Learning: An Optimal Prediction Model (Preprint)

10.2196/preprints.22865 ◽

2020 ◽

Author(s):

Anusha Ampavathi ◽

Vijaya Saradhi T

Keyword(s):

Feature Extraction ◽

Big Data ◽

Deep Learning ◽

Weight Function ◽

Optimization Algorithm ◽

Large Scale ◽

Heuristic Algorithms ◽

Disease Prediction ◽

Health Care Decisions ◽

Proposed Model

UNSTRUCTURED Big data and its approaches are generally helpful for healthcare and biomedical sectors for predicting the disease. For trivial symptoms, the difficulty is to meet the doctors at any time in the hospital. Thus, big data provides essential data regarding the diseases on the basis of the patient’s symptoms. For several medical organizations, disease prediction is important for making the best feasible health care decisions. Conversely, the conventional medical care model offers input as structured that requires more accurate and consistent prediction. This paper is planned to develop the multi-disease prediction using the improvised deep learning concept. Here, the different datasets pertain to “Diabetes, Hepatitis, lung cancer, liver tumor, heart disease, Parkinson’s disease, and Alzheimer’s disease”, from the benchmark UCI repository is gathered for conducting the experiment. The proposed model involves three phases (a) Data normalization (b) Weighted normalized feature extraction, and (c) prediction. Initially, the dataset is normalized in order to make the attribute's range at a certain level. Further, weighted feature extraction is performed, in which a weight function is multiplied with each attribute value for making large scale deviation. Here, the weight function is optimized using the combination of two meta-heuristic algorithms termed as Jaya Algorithm-based Multi-Verse Optimization algorithm (JA-MVO). The optimally extracted features are subjected to the hybrid deep learning algorithms like “Deep Belief Network (DBN) and Recurrent Neural Network (RNN)”. As a modification to hybrid deep learning architecture, the weight of both DBN and RNN is optimized using the same hybrid optimization algorithm. Further, the comparative evaluation of the proposed prediction over the existing models certifies its effectiveness through various performance measures.

Download Full-text

On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition

Sensors ◽

10.3390/s21051888 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1888

Author(s):

Juraj Kacur ◽

Boris Puterka ◽

Jarmila Pavlovicova ◽

Milos Oravec

Keyword(s):

Emotion Recognition ◽

Linear Prediction ◽

Filter Banks ◽

Vocal Tract ◽

Statistical Tests ◽

Extraction Methods ◽

Speech Emotion Recognition ◽

Speech Characteristics ◽

Evaluation Phase ◽

Cepstral Features

Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions—lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired t-test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0–8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases.

Download Full-text

Neural methods for effective, efficient, and exposure-aware information retrieval

ACM SIGIR Forum ◽

10.1145/3476415.3476434 ◽

2021 ◽

Vol 55 (1) ◽

pp. 1-2

Author(s):

Bhaskar Mitra

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Large Scale ◽

Web Search ◽

Real Life ◽

Inverted Index ◽

Information Need ◽

Product Model ◽

Performance Improvements ◽

Deep Model

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

Download Full-text