Improvement of Speaker Identification by Combining Prosodic Features with Acoustic Features

Author(s):  
Rong Zheng ◽  
Shuwu Zhang ◽  
Bo Xu
Author(s):  
Halim Sayoud ◽  
Siham Ouamour

Most existing systems of speaker recognition use “state of the art” acoustic features. However, many times one can only recognize a speaker by his or her prosodic features, especially by the accent. For this reason, the authors investigate some pertinent prosodic features that can be associated with other classic acoustic features, in order to improve the recognition accuracy. The authors have developed a new prosodic model using a modified LVQ (Learning Vector Quantization) algorithm, which is called MLVQ (Modified LVQ). This model is composed of three reduced prosodic features: the mean of the pitch, original duration, and low-frequency energy. Since these features are heterogeneous, a new optimized metric has been proposed that is called Optimized Distance for Heterogeneous Features (ODHEF). Tests of speaker identification are done on Arabic corpus because the NIST evaluations showed that speaker verification scores depend on the spoken language and that some of the worst scores were got for the Arabic language. Experimental results show good performances of the new prosodic approach.


Author(s):  
Halim Sayoud ◽  
Siham Ouamour

Most existing systems of speaker recognition use “state of the art” acoustic features. However, many times one can only recognize a speaker by his or her prosodic features, especially by the accent. For this reason, the authors investigate some pertinent prosodic features that can be associated with other classic acoustic features, in order to improve the recognition accuracy. The authors have developed a new prosodic model using a modified LVQ (Learning Vector Quantization) algorithm, which is called MLVQ (Modified LVQ). This model is composed of three reduced prosodic features: the mean of the pitch, original duration, and low-frequency energy. Since these features are heterogeneous, a new optimized metric has been proposed that is called Optimized Distance for Heterogeneous Features (ODHEF). Tests of speaker identification are done on Arabic corpus because the NIST evaluations showed that speaker verification scores depend on the spoken language and that some of the worst scores were got for the Arabic language. Experimental results show good performances of the new prosodic approach.


2018 ◽  
Vol 7 (2.16) ◽  
pp. 98 ◽  
Author(s):  
Mahesh K. Singh ◽  
A K. Singh ◽  
Narendra Singh

This paper emphasizes an algorithm that is based on acoustic analysis of electronics disguised voice. Proposed work is given a comparative analysis of all acoustic feature and its statistical coefficients. Acoustic features are computed by Mel-frequency cepstral coefficients (MFCC) method and compare with a normal voice and disguised voice by different semitones. All acoustic features passed through the feature based classifier and detected the identification rate of all type of electronically disguised voice. There are two types of support vector machine (SVM) and decision tree (DT) classifiers are used for speaker identification in terms of classification efficiency of electronically disguised voice by different semitones.  


2017 ◽  
Vol 29 (1) ◽  
pp. 59-71 ◽  
Author(s):  
Karim Youssef ◽  
◽  
Katsutoshi Itoyama ◽  
Kazuyoshi Yoshii

[abstFig src='/00290001/06.jpg' width='300' text='Efficient mobile speaker tracking' ] This paper jointly addresses the tasks of speaker identification and localization with binaural signals. The proposed system operates in noisy and echoic environments and involves limited computations. It demonstrates that a simultaneous identification and localization operation can benefit from a common signal processing front end for feature extraction. Moreover, a joint exploitation of the identity and position estimation outputs allows the outputs to limit each other’s errors. Equivalent rectangular bandwidth frequency cepstral coefficients (ERBFCC) and interaural level differences (ILD) are extracted. These acoustic features are respectively used for speaker identity and azimuth estimation through artificial neural networks (ANNs). The system was evaluated in simulated and real environments, with still and mobile speakers. Results demonstrate its ability to produce accurate estimations in the presence of noises and reflections. Moreover, the advantage of the binaural context over the monaural context for speaker identification is shown.


2021 ◽  
Vol 11 (10) ◽  
pp. 1344
Author(s):  
Viviana Mendoza Ramos ◽  
Anja Lowit ◽  
Leen Van den Steen ◽  
Hector Arturo Kairuz Hernandez-Diaz ◽  
Maria Esperanza Hernandez-Diaz Huici ◽  
...  

Dysprosody is a hallmark of dysarthria, which can affect the intelligibility and naturalness of speech. This includes sentence accent, which helps to draw listeners’ attention to important information in the message. Although some studies have investigated this feature, we currently lack properly validated automated procedures that can distinguish between subtle performance differences observed across speakers with dysarthria. This study aims for cross-population validation of a set of acoustic features that have previously been shown to correlate with sentence accent. In addition, the impact of dysarthria severity levels on sentence accent production is investigated. Two groups of adults were analysed (Dutch and English speakers). Fifty-eight participants with dysarthria and 30 healthy control participants (HCP) produced sentences with varying accent positions. All speech samples were evaluated perceptually and analysed acoustically with an algorithm that extracts ten meaningful prosodic features and allows a classification between accented and unaccented syllables based on a linear combination of these parameters. The data were statistically analysed using discriminant analysis. Within the Dutch and English dysarthric population, the algorithm correctly identified 82.8 and 91.9% of the accented target syllables, respectively, indicating that the capacity to discriminate between accented and unaccented syllables in a sentence is consistent with perceptual impressions. Moreover, different strategies for accent production across dysarthria severity levels could be demonstrated, which is an important step toward a better understanding of the nature of the deficit and the automatic classification of dysarthria severity using prosodic features.


Sensors ◽  
2021 ◽  
Vol 21 (15) ◽  
pp. 5097
Author(s):  
Mohammad Al-Qaderi ◽  
Elfituri Lahamer ◽  
Ahmad Rad

We present a new architecture to address the challenges of speaker identification that arise in interaction of humans with social robots. Though deep learning systems have led to impressive performance in many speech applications, limited speech data at training stage and short utterances with background noise at test stage present challenges and are still open problems as no optimum solution has been reported to date. The proposed design employs a generative model namely the Gaussian mixture model (GMM) and a discriminative model—support vector machine (SVM) classifiers as well as prosodic features and short-term spectral features to concurrently classify a speaker’s gender and his/her identity. The proposed architecture works in a semi-sequential manner consisting of two stages: the first classifier exploits the prosodic features to determine the speaker’s gender which in turn is used with the short-term spectral features as inputs to the second classifier system in order to identify the speaker. The second classifier system employs two types of short-term spectral features; namely mel-frequency cepstral coefficients (MFCC) and gammatone frequency cepstral coefficients (GFCC) as well as gender information as inputs to two different classifiers (GMM and GMM supervector-based SVM) which in total leads to construction of four classifiers. The outputs from the second stage classifiers; namely GMM-MFCC maximum likelihood classifier (MLC), GMM-GFCC MLC, GMM-MFCC supervector SVM, and GMM-GFCC supervector SVM are fused at score level by the weighted Borda count approach. The weight factors are computed on the fly via Mamdani fuzzy inference system that its inputs are the signal to noise ratio and the length of utterance. Experimental evaluations suggest that the proposed architecture and the fusion framework are promising and can improve the recognition performance of the system in challenging environments where the signal-to-noise ratio is low, and the length of utterance is short; such scenarios often arise in social robot interactions with humans.


Sign in / Sign up

Export Citation Format

Share Document