Improving Text-Dependent Speaker Recognition Performance

Background & Objective: Speaker Recognition (SR) techniques have been developed into a relatively mature status over the past few decades through development work. Existing methods typically use robust features extracted from clean speech signals, and therefore in idealized conditions can achieve very high recognition accuracy. For critical applications, such as security and forensics, robustness and reliability of the system are crucial. Methods: The background noise and reverberation as often occur in many real-world applications are known to compromise recognition performance. To improve the performance of speaker verification systems, an effective and robust technique is proposed to extract features for speech processing, capable of operating in the clean and noisy condition. Mel Frequency Cepstrum Coefficients (MFCCs) and Gammatone Frequency Cepstral Coefficients (GFCC) are the mature techniques and the most common features, which are used for speaker recognition. MFCCs are calculated from the log energies in frequency bands distributed over a mel scale. While GFCC has been acquired from a bank of Gammatone filters, which was originally suggested to model human cochlear filtering. This paper investigates the performance of GFCC and the conventional MFCC feature in clean and noisy conditions. The effects of the Signal-to-Noise Ratio (SNR) and language mismatch on the system performance have been taken into account in this work. Conclusion: Experimental results have shown significant improvement in system performance in terms of reduced equal error rate and detection error trade-off. Performance in terms of recognition rates under various types of noise, various Signal-to-Noise Ratios (SNRs) was quantified via simulation. Results of the study are also presented and discussed.

Download Full-text

Effects of the Carrier Phrase on Word Recognition Performances by Younger and Older Listeners Using Two Stimulus Paradigms

Journal of the American Academy of Audiology ◽

10.3766/jaaa.19061 ◽

2020 ◽

Vol 31 (06) ◽

pp. 412-441 ◽

Cited By ~ 1

Author(s):

Richard H. Wilson ◽

Victoria A. Sanchez

Keyword(s):

Word Recognition ◽

Target Word ◽

Speaker Recognition ◽

Repeated Measures ◽

Recognition Performance ◽

Test Word ◽

Root Mean Square Amplitude ◽

Repeated Measures Design ◽

The Mean ◽

The Individual

Abstract Background In the 1950s, with monitored live voice testing, the vu meter time constant and the short durations and amplitude modulation characteristics of monosyllabic words necessitated the use of the carrier phrase amplitude to monitor (indirectly) the presentation level of the words. This practice continues with recorded materials. To relieve the carrier phrase of this function, first the influence that the carrier phrase has on word recognition performance needs clarification, which is the topic of this study. Purpose Recordings of Northwestern University Auditory Test No. 6 by two female speakers were used to compare word recognition performances with and without the carrier phrases when the carrier phrase and test word were (1) in the same utterance stream with the words excised digitally from the carrier (VA-1 speaker) and (2) independent of one another (VA-2 speaker). The 50-msec segment of the vowel in the target word with the largest root mean square amplitude was used to equate the target word amplitudes. Research Design A quasi-experimental, repeated measures design was used. Study Sample Twenty-four young normal-hearing adults (YNH; M = 23.5 years; pure-tone average [PTA] = 1.3-dB HL) and 48 older hearing loss listeners (OHL; M = 71.4 years; PTA = 21.8-dB HL) participated in two, one-hour sessions. Data Collection and Analyses Each listener had 16 listening conditions (2 speakers × 2 carrier phrase conditions × 4 presentation levels) with 100 randomized words, 50 different words by each speaker. Each word was presented 8 times (2 carrier phrase conditions × 4 presentation levels [YNH, 0- to 24-dB SL; OHL, 6- to 30-dB SL]). The 200 recorded words for each condition were randomized as 8, 25-word tracks. In both test sessions, one practice track was followed by 16 tracks alternated between speakers and randomized by blocks of the four conditions. Central tendency and repeated measures analyses of variance statistics were used. Results With the VA-1 speaker, the overall mean recognition performances were 6.0% (YNH) and 8.3% (OHL) significantly better with the carrier phrase than without the carrier phrase. These differences were in part attributed to the distortion of some words caused by the excision of the words from the carrier phrases. With the VA-2 speaker, recognition performances on the with and without carrier phrase conditions by both listener groups were not significantly different, except for one condition (YNH listeners at 8-dB SL). The slopes of the mean functions were steeper for the YNH listeners (3.9%/dB to 4.8%/dB) than for the OHL listeners (2.4%/dB to 3.4%/dB) and were <1%/dB steeper for the VA-1 speaker than for the VA-2 speaker. Although the mean results were clear, the variability in performance differences between the two carrier phrase conditions for the individual participants and for the individual words was striking and was considered in detail. Conclusion The current data indicate that word recognition performances with and without the carrier phrase (1) were different when the carrier phrase and target word were produced in the same utterance with poorer performances when the target words were excised from their respective carrier phrases (VA-1 speaker), and (2) were the same when the carrier phrase and target word were produced as independent utterances (VA-2 speaker).

Download Full-text

Comparison of the automatic speaker recognition performance over standard features

2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics ◽

10.1109/sisy.2012.6339541 ◽

2012 ◽

Cited By ~ 3

Author(s):

Milan M. Dobrovic ◽

Vlado D. Delic ◽

Niksa M. Jakovljevic ◽

Ivan D. Jokic

Keyword(s):

Speaker Recognition ◽

Recognition Performance ◽

Automatic Speaker Recognition

Download Full-text

Assessing the speaker recognition performance of naive listeners using mechanical turk

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2011.5947708 ◽

2011 ◽

Cited By ~ 4

Author(s):

Wade Shen ◽

Joseph Campbell ◽

Derek Straub ◽

Reva Schwartz

Keyword(s):

Speaker Recognition ◽

Recognition Performance ◽

Mechanical Turk

Download Full-text

Using redundant parallel architecture to improve speaker recognition performance

Journal of Control Theory and Applications ◽

10.1007/s11768-008-5222-3 ◽

2008 ◽

Vol 6 (2) ◽

pp. 221-223

Author(s):

Zhengquan Qiu ◽

Junxun Yin ◽

Caiyun Fan

Keyword(s):

Speaker Recognition ◽

Recognition Performance ◽

Parallel Architecture

Download Full-text

Gammachirp Filter Banks Applied in Roust Speaker Recognition Based GMM-UBM Classifier

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/2/4 ◽

2019 ◽

Vol 17 (2) ◽

pp. 170-177

Author(s):

Lei Deng ◽

Yong Gao

Keyword(s):

Feature Extraction ◽

Speaker Recognition ◽

Recognition Performance ◽

Recognition System ◽

Cube Root ◽

Mel Frequency Cepstral Coefficients ◽

Feature Extraction Algorithm ◽

Extraction Algorithm ◽

Auditory Feature ◽

Cepstral Coefficients

In this paper, authors propose an auditory feature extraction algorithm in order to improve the performance of the speaker recognition system in noisy environments. In this auditory feature extraction algorithm, the Gammachirp filter bank is adapted to simulate the auditory model of human cochlea. In addition, the following three techniques are applied: cube-root compression method, Relative Spectral Filtering Technique (RASTA), and Cepstral Mean and Variance Normalization algorithm (CMVN).Subsequently, based on the theory of Gaussian Mixes Model-Universal Background Model (GMM-UBM), the simulated experiment was conducted. The experimental results implied that speaker recognition systems with the new auditory feature has better robustness and recognition performance compared to Mel-Frequency Cepstral Coefficients(MFCC), Relative Spectral-Perceptual Linear Predictive (RASTA-PLP),Cochlear Filter Cepstral Coefficients (CFCC) and gammatone Frequency Cepstral Coefficeints (GFCC)

Download Full-text

When Speaker Recognition Meets Noisy Labels: Optimizations for Front-ends and Back-ends

10.36227/techrxiv.17121863.v2 ◽

2021 ◽

Author(s):

Lin Li ◽

Fuchuan Tong ◽

Qingyang Hong

Keyword(s):

Speaker Recognition ◽

Large Scale ◽

Recognition Performance ◽

Recognition System ◽

Correction Method ◽

Superior Performance ◽

Label Noise ◽

Practical Applications ◽

Front End ◽

Noisy Labels

A typical speaker recognition system often involves two modules: a feature extractor front-end and a speaker identity back-end. Despite the superior performance that deep neural networks have achieved for the front-end, their success benefits from the availability of large-scale and correctly labeled datasets. While label noise is unavoidable in speaker recognition datasets, both the front-end and back-end are affected by label noise, which degrades the speaker recognition performance. In this paper, we first conduct comprehensive experiments to help improve the understanding of the effects of label noise on both the front-end and back-end. Then, we propose a simple yet effective training paradigm and loss correction method to handle label noise for the front-end. We combine our proposed method with the recently proposed Bayesian estimation of PLDA for noisy labels, and the whole system shows strong robustness to label noise. Furthermore, we show two practical applications of the improved system: one application corrects noisy labels based on an utterance’s chunk-level predictions, and the other algorithmically filters out high-confidence noisy samples within a dataset. By applying the second application to the NIST SRE0410 dataset and verifying filtered utterances by human validation, we identify that approximately 1% of the SRE04-10 dataset is made up of label errors.<br>

Download Full-text

The effects of telephone transmission degradations on speaker recognition performance

1995 International Conference on Acoustics, Speech, and Signal Processing ◽

10.1109/icassp.1995.479540 ◽

2002 ◽

Cited By ~ 29

Author(s):

D.A. Reynolds ◽

M.A. Zissman ◽

T.F. Quatieri ◽

G.C. O'Leary ◽

B.A. Carlson

Keyword(s):

Speaker Recognition ◽

Recognition Performance

Download Full-text

Analysis of the Effects of Supraglottal Tract Surgical Procedures in Automatic Speaker Recognition Performance

IEEE/ACM Transactions on Audio Speech and Language Processing ◽

10.1109/taslp.2020.2967567 ◽

2020 ◽

Vol 28 ◽

pp. 798-812

Author(s):

Laureano Moro-Velaquez ◽

Estefania Hernandez-Garcia ◽

Jorge A. Gomez-Garcia ◽

Juan I. Godino-Llorente ◽

Najim Dehak

Keyword(s):

Surgical Procedures ◽

Speaker Recognition ◽

Recognition Performance ◽

Automatic Speaker Recognition

Download Full-text

TIME-WARPING NETWORK: A NEURAL APPROACH TO HIDDEN MARKOV MODEL BASED SPEECH RECOGNITION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s021800149300039x ◽

1993 ◽

Vol 07 (04) ◽

pp. 783-799 ◽

Cited By ~ 3

Author(s):

ESTHER LEVIN ◽

ROBERTO PIERACCINI ◽

ENRICO BOCCHIERI

Keyword(s):

Speech Recognition ◽

Speaker Recognition ◽

Hidden Markov ◽

Recognition Performance ◽

Temporal Structure ◽

Recognition Task ◽

Neural Representation ◽

Discriminative Power ◽

Time Warping ◽

Gaussian Density

Recently, much interest has been generated regarding speech recognition systems based on Hidden Markov Models (HMMs) and neural network (NN) hybrids. Such systems attempt to combine the best features of both models: the temporal structure of HMMs and the discriminative power of neural networks. In this work we establish one more relation between the HMM and the NN paradigms by introducing the time-warping network (TWN) that is a generalization of both an HMM-based recognizer and a backpropagation net. The basic element of such a network, a time- warping neuron, extends the operation of the formal neuron of a backpropagation network by warping the input pattern to match it optimally to its weights. We show that a single-layer network of TW neurons is equivalent to a Gaussian density HMM-based recognition system. This equivalent neural representation suggests ways to improve the discriminative power of this system by using backpropagation discriminative training, and/or by generalizing the structure of the recognizer to a multi-layer net. The performance of the proposed network was evaluated on a highly confusable, isolated word, multi-speaker recognition task. The results indicate that not only does the recognition performance improve, but the separation between classes is enhanced, allowing us to set up a rejection criterion to improve the confidence of the system.

Download Full-text