scholarly journals Text-Independent Speaker Recognition Based on Adaptive Course Learning Loss and Deep Residual Network

Author(s):  
Qinghua Zhong ◽  
Ruining Dai ◽  
Han Zhang ◽  
YongSheng Zhu ◽  
Guofu Zhou

Abstract Text-independent speaker recognition is widely used in identity recognition. In order to improve the features recognition ability, a method of text-independent speaker recognition based on a deep residual network model was proposed in this paper. Firstly, the original audio was extracted with a 64-dimensional log filter bank signal features. Secondly, a deep residual network was used to extract log filter bank signal features. The deep residual network was composed of a residual network and a Convolutional Attention Statistics Pooling (CASP) layer. The CASP layer could aggregate the frame-level features from the residual network into utterance-level features. Lastly, Adaptive Curriculum Learning Loss (ACLL) classifiers was used to optimize the output of abstract features by the deep residual network, and the text-independent speaker recognition was completed by ACLL classifiers. The proposed method was applied to a large VoxCeleb2 dataset for extensive text-independent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1-E test dataset, and 3.24% on VoxCeleb1-H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1-E test dataset, and 1.69% on VoxCeleb1-H test dataset.

Author(s):  
Qinghua Zhong ◽  
Ruining Dai ◽  
Han Zhang ◽  
Yongsheng Zhu ◽  
Guofu Zhou

AbstractText-independent speaker recognition is widely used in identity recognition that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. In order to improve the recognition ability of log filter bank feature vectors, a method of text-independent speaker recognition based on deep residual networks model was proposed in this paper. The deep residual network was composed of a residual network (ResNet) and a convolutional attention statistics pooling (CASP) layer. The CASP layer could aggregate frame-level features from the ResNet into an utterance-level features. Extracting speech features for each speaker using deep residual networks was a promising direction to explore, and a straightforward solution was to train the discriminative feature extraction network by using a margin-based loss function. However, a margin-based loss function often has certain limitations, such as the margins between different categories were set to be the same and fixed. Thus, we used an adaptive curriculum learning loss (ACLL) to address the problem and introduce two different margin-based losses for this problem, i.e., AM-Softmax and AAM-Softmax. The proposed method was applied to a large-scale VoxCeleb2 dataset for extensive text-independent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1-E test dataset, and 3.24% on VoxCeleb1-H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1-E test dataset, and 1.69% on VoxCeleb1-H test dataset.


Electronics ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. 2201
Author(s):  
Ara Bae ◽  
Wooil Kim

One of the most recent speaker recognition methods that demonstrates outstanding performance in noisy environments involves extracting the speaker embedding using attention mechanism instead of average or statistics pooling. In the attention method, the speaker recognition performance is improved by employing multiple heads rather than a single head. In this paper, we propose advanced methods to extract a new embedding by compensating for the disadvantages of the single-head and multi-head attention methods. The combination method comprising single-head and split-based multi-head attentions shows a 5.39% Equal Error Rate (EER). When the single-head and projection-based multi-head attention methods are combined, the speaker recognition performance improves by 4.45%, which is the best performance in this work. Our experimental results demonstrate that the attention mechanism reflects the speaker’s properties more effectively than average or statistics pooling, and the speaker verification system could be further improved by employing combinations of different attention techniques.


2012 ◽  
Vol 9 (4) ◽  
pp. 1407-1430 ◽  
Author(s):  
Nakhat Fatima ◽  
Xiaojun Wu ◽  
Fang Zheng

Information of speech units like vowels, consonants and syllables can be a kind of knowledge used in text-independent Short Utterance Speaker Recognition (SUSR) in a similar way as in text-dependent speaker recognition. In such tasks, data for each speech unit, especially at the time of recognition, is often not enough. Hence, it is not practical to use the full set of speech units because some of the units might not be well trained. To solve this problem, a method of using speech unit categories rather than individual phones is proposed for SUSR, wherein similar speech units are put together, hence solving the problem of sparse data. We define Vowel, Consonant, and Syllable Categories (VC, CC and SC) with Standard Chinese (Putonghua) as a reference. A speech utterance is recognized into VC, CC ad SC sequences which are used to train Universal Background Models (UBM) for each speech unit category in the training procedure, and to perform speech unit category dependent speaker recognition, respectively. Experimental results in Gaussian Mixture Model-Universal Background Model (GMM-UBM) based system give a relative equal error rate (EER) reduction of 54.50% and 40.95% from minimum EERs of VCs and SCs, respectively, for 2 seconds of test utterance compared with the existing SUSR systems.


2020 ◽  
Vol 64 (4) ◽  
pp. 40404-1-40404-16
Author(s):  
I.-J. Ding ◽  
C.-M. Ruan

Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design. Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.


2011 ◽  
Vol 1 (1) ◽  
pp. 41-53 ◽  
Author(s):  
Fudong Li ◽  
Nathan Clarke ◽  
Maria Papadaki ◽  
Paul Dowland

Mobile devices have become essential to modern society; however, as their popularity has grown, so has the requirement to ensure devices remain secure. This paper proposes a behaviour-based profiling technique using a mobile user’s application usage to detect abnormal activities. Through operating transparently to the user, the approach offers significant advantages over traditional point-of-entry authentication and can provide continuous protection. The experiment employed the MIT Reality dataset and a total of 45,529 log entries. Four experiments were devised based on an application-level dataset containing the general application; two application-specific datasets combined with telephony and text message data; and a combined dataset that included both application-level and application-specific. Based on the experiments, a user’s profile was built using either static or dynamic profiles and the best experimental results for the application-level applications, telephone, text message, and multi-instance applications were an EER (Equal Error Rate) of 13.5%, 5.4%, 2.2%, and 10%, respectively.


The performance of Mel scale and Bark scale is evaluated for text-independent speaker identification system. Mel scale and Bark scale are designed according to human auditory system. The filter bank structure is defined using Mel and Bark scales for speech and speaker recognition systems to extract speaker specific speech features. In this work, performance of Mel scale and Bark scale is evaluated for text-independent speaker identification system. It is found that Bark scale centre frequencies are more effective than Mel scale centre frequencies in case of Indian dialect speaker databases. Mel scale is defined as per interpretation of pitch by human ear and Bark scale is based on critical band selectivity at which loudness becomes significantly different. The recognition rate achieved using Bark scale filter bank is 96% for AISSMSIOIT database and 95% for Marathi database.


2021 ◽  
Author(s):  
Srinivas Ramavath ◽  
Umesh Chandra Samal

Abstract In this paper, two new companders are designed to reduce the ratio of peak to average power (PAPR) experienced by filter bank multicarrier (FBMC) signals. Specifically, the compander basic model is generalized, which alter the distributed FBMC signal amplitude peak. The proposed companders design approach provides better performance in terms of reducing the PAPR, Bit Error Rate (BER) and phase error degradation over the previously existing compander schemes. Many PAPR reduction approaches, such as the µ-law companding technique, are also available. It results in the formation of spectrum side lobes, although the proposed techniques result in a spectrum with fewer side lobes. The theoretical analysis of linear compander and expander transform for a few specific parameters are derived and analyzed. The suggested linear companding technique is analytically analysed using simulations to show that it efficiently decreases the high peaks in the FBMC system.


2020 ◽  
Author(s):  
Anbiao Huang ◽  
Shuo Gao ◽  
Arokia Nathan

In Internet of Things (IoT) applications, among various authentication techniques, keystroke authentication methods based on a user’s touch behavior have received increasing attention, due to their unique benefits. In this paper, we present a technique for obtaining high user authentication accuracy by utilizing a user’s touch time and force information, which are obtained from an assembled piezoelectric touch panel. After combining artificial neural networks with the user’s touch features, an equal error rate (EER) of 1.09% is achieved, and hence advancing the development of security techniques in the field of IoT.


Sign in / Sign up

Export Citation Format

Share Document