scholarly journals When Speaker Recognition Meets Noisy Labels: Optimizations for Front-ends and Back-ends

Author(s):  
Lin Li ◽  
Fuchuan Tong ◽  
Qingyang Hong

A typical speaker recognition system often involves two modules: a feature extractor front-end and a speaker identity back-end. Despite the superior performance that deep neural networks have achieved for the front-end, their success benefits from the availability of large-scale and correctly labeled datasets. While label noise is unavoidable in speaker recognition datasets, both the front-end and back-end are affected by label noise, which degrades the speaker recognition performance. In this paper, we first conduct comprehensive experiments to help improve the understanding of the effects of label noise on both the front-end and back-end. Then, we propose a simple yet effective training paradigm and loss correction method to handle label noise for the front-end. We combine our proposed method with the recently proposed Bayesian estimation of PLDA for noisy labels, and the whole system shows strong robustness to label noise. Furthermore, we show two practical applications of the improved system: one application corrects noisy labels based on an utterance’s chunk-level predictions, and the other algorithmically filters out high-confidence noisy samples within a dataset. By applying the second application to the NIST SRE0410 dataset and verifying filtered utterances by human validation, we identify that approximately 1% of the SRE04-10 dataset is made up of label errors.<br>

2021 ◽  
Author(s):  
Lin Li ◽  
Fuchuan Tong ◽  
Qingyang Hong

A typical speaker recognition system often involves two modules: a feature extractor front-end and a speaker identity back-end. Despite the superior performance that deep neural networks have achieved for the front-end, their success benefits from the availability of large-scale and correctly labeled datasets. While label noise is unavoidable in speaker recognition datasets, both the front-end and back-end are affected by label noise, which degrades the speaker recognition performance. In this paper, we first conduct comprehensive experiments to help improve the understanding of the effects of label noise on both the front-end and back-end. Then, we propose a simple yet effective training paradigm and loss correction method to handle label noise for the front-end. We combine our proposed method with the recently proposed Bayesian estimation of PLDA for noisy labels, and the whole system shows strong robustness to label noise. Furthermore, we show two practical applications of the improved system: one application corrects noisy labels based on an utterance’s chunk-level predictions, and the other algorithmically filters out high-confidence noisy samples within a dataset. By applying the second application to the NIST SRE0410 dataset and verifying filtered utterances by human validation, we identify that approximately 1% of the SRE04-10 dataset is made up of label errors.<br>


2021 ◽  
Author(s):  
Lin Li ◽  
Fuchuan Tong ◽  
Qingyang Hong

A typical speaker recognition system often involves two modules: a feature extractor front-end and a speaker identity back-end. Despite the superior performance that deep neural networks have achieved for the front-end, their success benefits from the availability of large-scale and correctly labeled datasets. While label noise is unavoidable in speaker recognition datasets, both the front-end and back-end are affected by label noise, which degrades the speaker recognition performance. In this paper, we first conduct comprehensive experiments to help improve the understanding of the effects of label noise on both the front-end and back-end. Then, we propose a simple yet effective training paradigm and loss correction method to handle label noise for the front-end. We combine our proposed method with the recently proposed Bayesian estimation of PLDA for noisy labels, and the whole system shows strong robustness to label noise. Furthermore, we show two practical applications of the improved system: one application corrects noisy labels based on an utterance’s chunk-level predictions, and the other algorithmically filters out high-confidence noisy samples within a dataset. By applying the second application to the NIST SRE0410 dataset and verifying filtered utterances by human validation, we identify that approximately 1% of the SRE04-10 dataset is made up of label errors.<br>


2021 ◽  
Vol 11 (21) ◽  
pp. 10079
Author(s):  
Muhammad Firoz Mridha ◽  
Abu Quwsar Ohi ◽  
Muhammad Mostafa Monowar ◽  
Md. Abdul Hamid ◽  
Md. Rashedul Islam ◽  
...  

Speaker recognition deals with recognizing speakers by their speech. Most speaker recognition systems are built upon two stages, the first stage extracts low dimensional correlation embeddings from speech, and the second performs the classification task. The robustness of a speaker recognition system mainly depends on the extraction process of speech embeddings, which are primarily pre-trained on a large-scale dataset. As the embedding systems are pre-trained, the performance of speaker recognition models greatly depends on domain adaptation policy, which may reduce if trained using inadequate data. This paper introduces a speaker recognition strategy dealing with unlabeled data, which generates clusterable embedding vectors from small fixed-size speech frames. The unsupervised training strategy involves an assumption that a small speech segment should include a single speaker. Depending on such a belief, a pairwise constraint is constructed with noise augmentation policies, used to train AutoEmbedder architecture that generates speaker embeddings. Without relying on domain adaption policy, the process unsupervisely produces clusterable speaker embeddings, termed unsupervised vectors (u-vectors). The evaluation is concluded in two popular speaker recognition datasets for English language, TIMIT, and LibriSpeech. Also, a Bengali dataset is included to illustrate the diversity of the domain shifts for speaker recognition systems. Finally, we conclude that the proposed approach achieves satisfactory performance using pairwise architectures.


2019 ◽  
Vol 17 (2) ◽  
pp. 170-177
Author(s):  
Lei Deng ◽  
Yong Gao

In this paper, authors propose an auditory feature extraction algorithm in order to improve the performance of the speaker recognition system in noisy environments. In this auditory feature extraction algorithm, the Gammachirp filter bank is adapted to simulate the auditory model of human cochlea. In addition, the following three techniques are applied: cube-root compression method, Relative Spectral Filtering Technique (RASTA), and Cepstral Mean and Variance Normalization algorithm (CMVN).Subsequently, based on the theory of Gaussian Mixes Model-Universal Background Model (GMM-UBM), the simulated experiment was conducted. The experimental results implied that speaker recognition systems with the new auditory feature has better robustness and recognition performance compared to Mel-Frequency Cepstral Coefficients(MFCC), Relative Spectral-Perceptual Linear Predictive (RASTA-PLP),Cochlear Filter Cepstral Coefficients (CFCC) and gammatone Frequency Cepstral Coefficeints (GFCC)


2013 ◽  
Vol 2013 ◽  
pp. 1-13 ◽  
Author(s):  
Gaohang Yu ◽  
Shanzhou Niu ◽  
Jianhua Ma ◽  
Yisheng Song

Combining multivariate spectral gradient method with projection scheme, this paper presents an adaptive prediction-correction method for solving large-scale nonlinear systems of monotone equations. The proposed method possesses some favorable properties: (1) it is progressive step by step, that is, the distance between iterates and the solution set is decreasing monotonically; (2) global convergence result is independent of the merit function and its Lipschitz continuity; (3) it is a derivative-free method and could be applied for solving large-scale nonsmooth equations due to its lower storage requirement. Preliminary numerical results show that the proposed method is very effective. Some practical applications of the proposed method are demonstrated and tested on sparse signal reconstruction, compressed sensing, and image deconvolution problems.


2021 ◽  
Vol 12 (3) ◽  
pp. 1-23
Author(s):  
Yan Liu ◽  
Bin Guo ◽  
Daqing Zhang ◽  
Djamal Zeghlache ◽  
Jingmin Chen ◽  
...  

Optimal store placement aims to identify the optimal location for a new brick-and-mortar store that can maximize its sale by analyzing and mining users’ preferences from large-scale urban data. In recent years, the expansion of chain enterprises in new cities brings some challenges because of two aspects: (1) data scarcity in new cities, so most existing models tend to not work (i.e., overfitting), because the superior performance of these works is conditioned on large-scale training samples; (2) data distribution discrepancy among different cities, so knowledge learned from other cities cannot be utilized directly in new cities. In this article, we propose a task-adaptative model-agnostic meta-learning framework, namely, MetaStore, to tackle these two challenges and improve the prediction performance in new cities with insufficient data for optimal store placement, by transferring prior knowledge learned from multiple data-rich cities. Specifically, we develop a task-adaptative meta-learning algorithm to learn city-specific prior initializations from multiple cities, which is capable of handling the multimodal data distribution and accelerating the adaptation in new cities compared to other methods. In addition, we design an effective learning strategy for MetaStore to promote faster convergence and optimization by sampling high-quality data for each training batch in view of noisy data in practical applications. The extensive experimental results demonstrate that our proposed method leads to state-of-the-art performance compared with various baselines.


Author(s):  
Tai D. Nguyen ◽  
Ronald Gronsky ◽  
Jeffrey B. Kortright

Nanometer period Ru/C multilayers are one of the prime candidates for normal incident reflecting mirrors at wavelengths < 10 nm. Superior performance, which requires uniform layers and smooth interfaces, and high stability of the layered structure under thermal loadings are some of the demands in practical applications. Previous studies however show that the Ru layers in the 2 nm period Ru/C multilayer agglomerate upon moderate annealing, and the layered structure is no longer retained. This agglomeration and crystallization of the Ru layers upon annealing to form almost spherical crystallites is a result of the reduction of surface or interfacial energy from die amorphous high energy non-equilibrium state of the as-prepared sample dirough diffusive arrangements of the atoms. Proposed models for mechanism of thin film agglomeration include one analogous to Rayleigh instability, and grain boundary grooving in polycrystalline films. These models however are not necessarily appropriate to explain for the agglomeration in the sub-nanometer amorphous Ru layers in Ru/C multilayers. The Ru-C phase diagram shows a wide miscible gap, which indicates the preference of phase separation between these two materials and provides an additional driving force for agglomeration. In this paper, we study the evolution of the microstructures and layered structure via in-situ Transmission Electron Microscopy (TEM), and attempt to determine the order of occurence of agglomeration and crystallization in the Ru layers by observing the diffraction patterns.


2018 ◽  
Vol 1 (2) ◽  
pp. 34-44
Author(s):  
Faris E Mohammed ◽  
Dr. Eman M ALdaidamony ◽  
Prof. A. M Raid

Individual identification process is a very significant process that resides a large portion of day by day usages. Identification process is appropriate in work place, private zones, banks …etc. Individuals are rich subject having many characteristics that can be used for recognition purpose such as finger vein, iris, face …etc. Finger vein and iris key-points are considered as one of the most talented biometric authentication techniques for its security and convenience. SIFT is new and talented technique for pattern recognition. However, some shortages exist in many related techniques, such as difficulty of feature loss, feature key extraction, and noise point introduction. In this manuscript a new technique named SIFT-based iris and SIFT-based finger vein identification with normalization and enhancement is proposed for achieving better performance. In evaluation with other SIFT-based iris or SIFT-based finger vein recognition algorithms, the suggested technique can overcome the difficulties of tremendous key-point extraction and exclude the noise points without feature loss. Experimental results demonstrate that the normalization and improvement steps are critical for SIFT-based recognition for iris and finger vein , and the proposed technique can accomplish satisfactory recognition performance. Keywords: SIFT, Iris Recognition, Finger Vein identification and Biometric Systems.   © 2018 JASET, International Scholars and Researchers Association    


Sign in / Sign up

Export Citation Format

Share Document