Mixture linear prediction Gammatone Cepstral features for robust speaker verification under transmission channel noise

2020 ◽  
Vol 79 (25-26) ◽  
pp. 18679-18693
Author(s):  
Ahmed Krobba ◽  
Mohamed Debyeche ◽  
Sid-Ahmed Selouani
Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1888
Author(s):  
Juraj Kacur ◽  
Boris Puterka ◽  
Jarmila Pavlovicova ◽  
Milos Oravec

Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions—lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired t-test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0–8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases.


2018 ◽  
Vol 29 (1) ◽  
pp. 565-582
Author(s):  
T.R. Jayanthi Kumari ◽  
H.S. Jayanna

Abstract In many biometric applications, limited data speaker verification plays a significant role in practical-oriented systems to verify the speaker. The performance of the speaker verification system needs to be improved by applying suitable techniques to limited data condition. The limited data represent both train and test data duration in terms of few seconds. This article shows the importance of the speaker verification system under limited data condition using feature- and score-level fusion techniques. The baseline speaker verification system uses vocal tract features like mel-frequency cepstral coefficients, linear predictive cepstral coefficients and excitation source features like linear prediction residual and linear prediction residual phase as features along with i-vector modeling techniques using the NIST 2003 data set. In feature-level fusion, the vocal tract features are fused with excitation source features. As a result, on average, equal error rate (EER) is approximately equal to 4% compared to individual feature performance. Further in this work, two different types of score-level fusion are demonstrated. In the first case, fusing the scores of vocal tract features and excitation source features at score-level-maintaining modeling technique remains the same, which provides an average reduction approximately equal to 2% EER compared to feature-level fusion performance. In the second case, scores of the different modeling techniques are combined, which has resulted in EER reduction approximately equal to 4.5% compared with score-level fusion of different features.


2018 ◽  
Vol 7 (2) ◽  
pp. 123-127
Author(s):  
S. Sathiamoorthy ◽  
R. Ponnusamy ◽  
R. Visalakshi

In this paper, we presented the performance of a speaker verification system based on features computed from the speech recorded using a Close Speaking Microphone(CSM) and Throat Microphone(TM) in clean and noisy environment. Noise is the one of the most complicated problem in speaker verification system. The background noises affect the performance of speaker verification using CSM. To overcome this issue, TM is used which has a transducer held at the throat resulting in a clean signal and unaffected by background noises. Acoustic features are computed by means of Relative Spectral Transform-Perceptual Linear Prediction (RASTA-PLP). Autoassociative neural network (AANN) technique is used to extract the features and in order to confirm the speakers from clean and noisy environment. A new method is presented in this paper, for verification of speakers in clean using combined CSM and TM. The verification performance of the proposed combined system is significantly better than the system using the CSM alone due to the complementary nature of CSM and TM. It is evident that an EER of about 1.0% for the combined devices (CSM+TM) by evaluating the FAR and FRR values and the overall verification of 99% is obtained in clean speech.


2011 ◽  
Author(s):  
Marcel Kockmann ◽  
Luciana Ferrer ◽  
Lukáš Burget ◽  
Jan Černocký

Sign in / Sign up

Export Citation Format

Share Document