Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1888
Author(s):  
Juraj Kacur ◽  
Boris Puterka ◽  
Jarmila Pavlovicova ◽  
Milos Oravec

Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions—lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired t-test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0–8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases.


2005 ◽  
Vol 83 (7) ◽  
pp. 721-737
Author(s):  
H Teffahi ◽  
B Guerin ◽  
A Djeradi

Knowledge of vocal tract area functions is important for the understanding of phenomena occurring during speech production. We present here a new measurement method based on the external excitation of the vocal tract with a known pseudo-random sequence, where the area function is obtained by a linear prediction analysis applied to the cross-correlation between the sequence and the signal measured at the lips. The advantages of this method over methods based on sweep-tones or white noise excitation are (1) a much shorter measurement time (about 100 ms) and (2) the possibility of speech sound production during the measurement. This method has been checked against classical methods through systematic comparisons on a small corpus of vowels. Moreover, it has been verified that simultaneous speech sound production does not perturb significantly the measurements. This method should thus be a very helpful tool for the investigation of the acoustic properties of the vocal tract in various cases for vowels.


2018 ◽  
Vol 29 (1) ◽  
pp. 565-582
Author(s):  
T.R. Jayanthi Kumari ◽  
H.S. Jayanna

Abstract In many biometric applications, limited data speaker verification plays a significant role in practical-oriented systems to verify the speaker. The performance of the speaker verification system needs to be improved by applying suitable techniques to limited data condition. The limited data represent both train and test data duration in terms of few seconds. This article shows the importance of the speaker verification system under limited data condition using feature- and score-level fusion techniques. The baseline speaker verification system uses vocal tract features like mel-frequency cepstral coefficients, linear predictive cepstral coefficients and excitation source features like linear prediction residual and linear prediction residual phase as features along with i-vector modeling techniques using the NIST 2003 data set. In feature-level fusion, the vocal tract features are fused with excitation source features. As a result, on average, equal error rate (EER) is approximately equal to 4% compared to individual feature performance. Further in this work, two different types of score-level fusion are demonstrated. In the first case, fusing the scores of vocal tract features and excitation source features at score-level-maintaining modeling technique remains the same, which provides an average reduction approximately equal to 2% EER compared to feature-level fusion performance. In the second case, scores of the different modeling techniques are combined, which has resulted in EER reduction approximately equal to 4.5% compared with score-level fusion of different features.


Author(s):  
Radhika Rani L ◽  
S. Chandra lingam ◽  
Anjaneyulu T ◽  
Satyanarayana K

Congenital Heart Defects (CHD) are the critical heart disorders that can be observed at the birth stage of the infants. These are classified mainly into two, Cyanotic and Acyanotic. The present paper concentrates on the Acyanotic heart disorders. Acyanotic heart disorder cannot be observed on external checkup, whereas bluish skin is the indication for the infant affected with Cyanotic disorder. Acyanotic heart disorder can only be diagnosed using chest X-Ray, ECG, Echocardiogram, Cardiac Catheterization and MRI of the Heart. The present work aims at estimating the fundamental frequency (pitch) and the vocal tract resonant frequencies (formants) from the cry signal of the infants. The pitch frequency and formant frequencies are estimated using frequency domain (Cepstrum) and Linear Prediction Code (LPC) methods. The results show that the fundamental frequency of the cry signal was between 600Hz-800Hz for the infants with Acyanotic heart disorders. This fundamental frequency helps in identifying Acyanotic heart disorders at an early stage.


2019 ◽  
Vol 2019 ◽  
pp. 1-8
Author(s):  
Khaled Daqrouq ◽  
Abdel-Rahman Al-Qawasmi ◽  
Ahmed Balamesh ◽  
Ali S. Alghamdi ◽  
Mohamed A. Al-Amoudi

Speech parameters may include perturbation measurements, spectral and cepstral modeling, and pathological effects of some diseases, like influenza, that affect the vocal tract. The verification task is a very good process to discriminate between different types of voice disorder. This study investigated the modeling of influenza’s pathological effects on the speech signals of the Arabic vowels “A” and “O.” For feature extraction, linear prediction coding (LPC) of discrete wavelet transform (DWT) subsignals denoted by LPCW was used. k-Nearest neighbor (KNN) and support vector machine (SVM) classifiers were used for classification. To study the pathological effects of influenza on the vowel “A” and vowel “O,” power spectral density (PSD) and spectrogram were illustrated, where the PSD of “A” and “O” was repressed as a result of the pathological effects. The obtained results showed that the verification parameters achieved for the vowel “A” were better than those for vowel “O” for both KNN and SVM for an average. The receiver operating characteristic curve was used for interpretation. The modeling by the speech utterances as words was also investigated. We can claim that the speech utterances as words could model the influenza disease with a good quality of the verification parameters with slightly less performance than the vowels “A” as speech utterances. A comparison with state-of-the-art method was made. The best results were achieved by the LPCW method.


2012 ◽  
Vol 2012 ◽  
pp. 1-8 ◽  
Author(s):  
Mousmita Sarma ◽  
Kandarpa Kumar Sarma

In spoken word recognition, one of the crucial points is to identify the vowel phonemes. This paper describes an Artificial Neural Network (ANN) based algorithm developed for the segmentation and recognition of the vowel phonemes of Assamese language from some words containing those vowels. Self-Organizing Map (SOM) trained with a various number of iterations is used to segment the word into its constituent phonemes. Later, Probabilistic Neural Network (PNN) trained with clean vowel phonemes is used to recognize the vowel segment from the six different SOM segmented phonemes. One of the important aspects of the proposed algorithm is that it proves the validation of the recognized vowel by checking its first formant frequency. The first formant frequency of all the Assamese vowels is predetermined by estimating pole or formant location from the linear prediction (LP) model of the vocal tract. The proposed algorithm shows a high recognition performance in comparison to the conventional Discrete Wavelet Transform (DWT) based segmentation.


2006 ◽  
Vol 12 (1) ◽  
pp. 50-55
Author(s):  
Povilas Treigys ◽  
Antanas Lipeika

The problem of speaker identification is investigated. Basic segments ‐ pseudo stationary intervals of voiced sounds are used for identification. The identification is carried out, comparing average distances between an investigative and comparatives. The coefficients of the linear prediction model (LPC) of a vocal tract are used as features of identification. Such a problem arises in stenographic practice where it is important for speech identification to know who is speaking. Identification should be used in stenography and it has to be fast enough in order not to disturb the stenographer's job. The clustered parameter data will be investigated by providing the performance of the speaker identification method with respect to the computational time and the number of errors.


2021 ◽  
Author(s):  
Puneet Bawa ◽  
Virender Kadyan ◽  
Vaibhav Kumar ◽  
Ghanshyam Raghuwanshi

Abstract In real-life applications, noise originating from different sound sources modifies the characteristics of an input signal which affects the development of an enhanced ASR system. This contamination degrades the quality and comprehension of speech variables while impacting the performance of human-machine communication systems. This paper aims to minimise noise challenges by using a robust feature extraction methodology through introduction of an optimised filtering technique. Initially, the evaluations for enhancing input signals are constructed by using state transformation matrix and minimising a mean square error based upon the linear time variance techniques of Kalman and Adaptive Wiener Filtering. Consequently, Mel-frequency cepstral coefficients (MFCC), Linear Predictive Cepstral Coefficient (LPCC), RelAtive SpecTrAl-Perceptual Linear Prediction (RASTA-PLP) and Gammatone Frequency cepstral coefficient (GFCC) based feature extraction methods have been synthesised with their comparable efficiency in order to derive the adequate characteristics of a signal. It also handle the large-scale training complexities lies among the training and testing dataset. Consequently, the acoustic mismatch and linguistic complexity of large-scale variations lies within small set of speakers have been handle by utilising the Vocal Tract Length Normalization (VTLN) based warping of the test utterances. Furthermore, the spectral warping approach has been used by time reversing the samples inside a frame and passing them into the filter network corresponding to each frame. Finally, the overall Relative Improvement (RI) of 16.13% on 5-way perturbed spectral warped based noise augmented dataset through Wiener Filtering in comparison to other systems respectively.


Sign in / Sign up

Export Citation Format

Share Document