scholarly journals Speaker Verification Employing Combinations of Self-Attention Mechanisms

Electronics ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. 2201
Author(s):  
Ara Bae ◽  
Wooil Kim

One of the most recent speaker recognition methods that demonstrates outstanding performance in noisy environments involves extracting the speaker embedding using attention mechanism instead of average or statistics pooling. In the attention method, the speaker recognition performance is improved by employing multiple heads rather than a single head. In this paper, we propose advanced methods to extract a new embedding by compensating for the disadvantages of the single-head and multi-head attention methods. The combination method comprising single-head and split-based multi-head attentions shows a 5.39% Equal Error Rate (EER). When the single-head and projection-based multi-head attention methods are combined, the speaker recognition performance improves by 4.45%, which is the best performance in this work. Our experimental results demonstrate that the attention mechanism reflects the speaker’s properties more effectively than average or statistics pooling, and the speaker verification system could be further improved by employing combinations of different attention techniques.

Author(s):  
Khamis A. Al-Karawi

Background & Objective: Speaker Recognition (SR) techniques have been developed into a relatively mature status over the past few decades through development work. Existing methods typically use robust features extracted from clean speech signals, and therefore in idealized conditions can achieve very high recognition accuracy. For critical applications, such as security and forensics, robustness and reliability of the system are crucial. Methods: The background noise and reverberation as often occur in many real-world applications are known to compromise recognition performance. To improve the performance of speaker verification systems, an effective and robust technique is proposed to extract features for speech processing, capable of operating in the clean and noisy condition. Mel Frequency Cepstrum Coefficients (MFCCs) and Gammatone Frequency Cepstral Coefficients (GFCC) are the mature techniques and the most common features, which are used for speaker recognition. MFCCs are calculated from the log energies in frequency bands distributed over a mel scale. While GFCC has been acquired from a bank of Gammatone filters, which was originally suggested to model human cochlear filtering. This paper investigates the performance of GFCC and the conventional MFCC feature in clean and noisy conditions. The effects of the Signal-to-Noise Ratio (SNR) and language mismatch on the system performance have been taken into account in this work. Conclusion: Experimental results have shown significant improvement in system performance in terms of reduced equal error rate and detection error trade-off. Performance in terms of recognition rates under various types of noise, various Signal-to-Noise Ratios (SNRs) was quantified via simulation. Results of the study are also presented and discussed.


2011 ◽  
Vol 1 (1) ◽  
pp. 41-53 ◽  
Author(s):  
Fudong Li ◽  
Nathan Clarke ◽  
Maria Papadaki ◽  
Paul Dowland

Mobile devices have become essential to modern society; however, as their popularity has grown, so has the requirement to ensure devices remain secure. This paper proposes a behaviour-based profiling technique using a mobile user’s application usage to detect abnormal activities. Through operating transparently to the user, the approach offers significant advantages over traditional point-of-entry authentication and can provide continuous protection. The experiment employed the MIT Reality dataset and a total of 45,529 log entries. Four experiments were devised based on an application-level dataset containing the general application; two application-specific datasets combined with telephony and text message data; and a combined dataset that included both application-level and application-specific. Based on the experiments, a user’s profile was built using either static or dynamic profiles and the best experimental results for the application-level applications, telephone, text message, and multi-instance applications were an EER (Equal Error Rate) of 13.5%, 5.4%, 2.2%, and 10%, respectively.


Electronics ◽  
2020 ◽  
Vol 9 (10) ◽  
pp. 1706
Author(s):  
Soonshin Seo ◽  
Ji-Hwan Kim

One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).


2017 ◽  
Vol 13 (10) ◽  
pp. 6531-6542
Author(s):  
P Shanmugapriya ◽  
Y. Venkataramani

The integration of GMM- super vector and Support Vector Machine (SVM) has become one of most popular strategy in text-independent speaker verification system.  This paper describes the application of Fuzzy Support Vector Machine (FSVM) for classification of speakers using GMM-super vectors. Super vectors are formed by stacking the mean vectors of adapted GMMs from UBM using maximum a posteriori (MAP). GMM super vectors characterize speaker’s acoustic characteristics which are used for developing a speaker dependent fuzzy SVM model. Introducing fuzzy theory in support vector machine yields better classification accuracy and requires less number of support vectors. Experiments were conducted on 2001 NIST speaker recognition evaluation corpus. Performance of GMM-FSVM based speaker verification system is compared with the conventional GMM-UBM and GMM-SVM based systems.  Experimental results indicate that the fuzzy SVM based speaker verification system with GMM super vector achieves better performance to GMM-UBM system.  


Author(s):  
Minho Jin ◽  
Chang D. Yoo

A speaker recognition system verifies or identifies a speaker’s identity based on his/her voice. It is considered as one of the most convenient biometric characteristic for human machine communication. This chapter introduces several speaker recognition systems and examines their performances under various conditions. Speaker recognition can be classified into either speaker verification or speaker identification. Speaker verification aims to verify whether an input speech corresponds to a claimed identity, and speaker identification aims to identify an input speech by selecting one model from a set of enrolled speaker models. Both the speaker verification and identification system consist of three essential elements: feature extraction, speaker modeling, and matching. The feature extraction pertains to extracting essential features from an input speech for speaker recognition. The speaker modeling pertains to probabilistically modeling the feature of the enrolled speakers. The matching pertains to matching the input feature to various speaker models. Speaker modeling techniques including Gaussian mixture model (GMM), hidden Markov model (HMM), and phone n-grams are presented, and in this chapter, their performances are compared under various tasks. Several verification and identification experimental results presented in this chapter indicate that speaker recognition performances are highly dependent on the acoustical environment. A comparative study between human listeners and an automatic speaker verification system is presented, and it indicates that an automatic speaker verification system can outperform human listeners. The applications of speaker recognition are summarized, and finally various obstacles that must be overcome are discussed.


2018 ◽  
Vol 2018 ◽  
pp. 1-11 ◽  
Author(s):  
Robertas Damaševičius ◽  
Rytis Maskeliūnas ◽  
Egidijus Kazanavičius ◽  
Marcin Woźniak

Cryptographic frameworks depend on key sharing for ensuring security of data. While the keys in cryptographic frameworks must be correctly reproducible and not unequivocally connected to the identity of a user, in biometric frameworks this is different. Joining cryptography techniques with biometrics can solve these issues. We present a biometric authentication method based on the discrete logarithm problem and Bose-Chaudhuri-Hocquenghem (BCH) codes, perform its security analysis, and demonstrate its security characteristics. We evaluate a biometric cryptosystem using our own dataset of electroencephalography (EEG) data collected from 42 subjects. The experimental results show that the described biometric user authentication system is effective, achieving an Equal Error Rate (ERR) of 0.024.


2013 ◽  
Vol 710 ◽  
pp. 655-659
Author(s):  
Zhi Xian Jiu ◽  
Qiang Li

In this paper we report on a curvelet and wavelet based palm vein recognition algorithm. Using our palm vein image database, we employed minimum distance classifier to test the performance of the system. Experimental results show that the algorithm based on cuvelet transform can reach equal error rate of 1.7%, and the algorithm based on wavelet transform can only reach equal error rate of 2.3%, indicating that the curvelet based palm vein recognition system improves representation.


Sensors ◽  
2020 ◽  
Vol 20 (23) ◽  
pp. 6784
Author(s):  
Xin Fang ◽  
Tian Gao ◽  
Liang Zou ◽  
Zhenhua Ling

Automatic speaker verification provides a flexible and effective way for biometric authentication. Previous deep learning-based methods have demonstrated promising results, whereas a few problems still require better solutions. In prior works examining speaker discriminative neural networks, the speaker representation of the target speaker is regarded as a fixed one when comparing with utterances from different speakers, and the joint information between enrollment and evaluation utterances is ignored. In this paper, we propose to combine CNN-based feature learning with a bidirectional attention mechanism to achieve better performance with only one enrollment utterance. The evaluation-enrollment joint information is exploited to provide interactive features through bidirectional attention. In addition, we introduce one individual cost function to identify the phonetic contents, which contributes to calculating the attention score more specifically. These interactive features are complementary to the constant ones, which are extracted from individual speakers separately and do not vary with the evaluation utterances. The proposed method archived a competitive equal error rate of 6.26% on the internal “DAN DAN NI HAO” benchmark dataset with 1250 utterances and outperformed various baseline methods, including the traditional i-vector/PLDA, d-vector, self-attention, and sequence-to-sequence attention models.


2013 ◽  
Vol 284-287 ◽  
pp. 3270-3274 ◽  
Author(s):  
Chien Cheng Lin ◽  
Chin Chun Chang ◽  
De Ron Liang ◽  
Ching Han Yang

This paper proposes a non-intrusive authentication method based on two sensitive apparatus of smartphones, namely, the orientation sensor and the touchscreen. We have found that these two sensors are capable of capturing behavioral biometrics of a user while the user is engaged in relatively stationary activities. The experimental results with respect to two types of flick operating have an equal error rate of about 3.5% and 5%, respectively. To the best of our knowledge, this work is the first publicly reported study that simultaneously adopts the orientation sensor and the touchscreen to build an authentication model for smartphone users. Finally, we show that the proposed approach can be used together with existing intrusive mechanisms, such as password and/or fingerprints, to build a more robust authentication framework for smartphone users.


Author(s):  
CINTHIA O. A. FREITAS ◽  
FLÁVIO BORTOLOZZI ◽  
ROBERT SABOURIN

The study investigates the perceptual feature similarity between different lexicons based on visual perception of the words and their representation through an observation sequence. We confirm that it is possible to use databases, which are similar in terms of morphological/perceptual features to improve the recognition performance. In this work, we demonstrated through experimentation, that it is possible to improve the recognition rate of handwritten Portuguese words by adding samples of French words in the training set. Experimental results show the efficiency of this strategy reducing the error rate.


Sign in / Sign up

Export Citation Format

Share Document