Speaker Recognition Based on Fusion of a Deep and Shallow Recombination Gaussian Supervector

Linhui Sun; Yunyi Bu; Bo Zou; Sheng Fu; Pingan Li

doi:10.3390/electronics10010020

Speaker Recognition Based on Fusion of a Deep and Shallow Recombination Gaussian Supervector

Electronics ◽

10.3390/electronics10010020 ◽

2020 ◽

Vol 10 (1) ◽

pp. 20

Author(s):

Linhui Sun ◽

Yunyi Bu ◽

Bo Zou ◽

Sheng Fu ◽

Pingan Li

Keyword(s):

Speaker Recognition ◽

Feature Fusion ◽

Recognition Rate ◽

Gaussian Mixture ◽

Recognition Method ◽

Different Types ◽

Feature Based ◽

Mel Frequency Cepstral Coefficient ◽

Fusion Feature ◽

Weight Coefficients

Extracting speaker’s personalized feature parameters is vital for speaker recognition. Only one kind of feature cannot fully reflect the speaker’s personality information. In order to represent the speaker’s identity more comprehensively and improve speaker recognition rate, we propose a speaker recognition method based on the fusion feature of a deep and shallow recombination Gaussian supervector. In this method, the deep bottleneck features are first extracted by Deep Neural Network (DNN), which are used for the input of the Gaussian Mixture Model (GMM) to obtain the deep Gaussian supervector. On the other hand, we input the Mel-Frequency Cepstral Coefficient (MFCC) to GMM directly to extract the traditional Gaussian supervector. Finally, the two categories of features are combined in the form of horizontal dimension augmentation. In addition, when the number of speakers to be recognized increases, in order to prevent the system recognition rate from falling sharply, we introduce the optimization algorithm to find the optimal weight before the feature fusion. The experiment results indicate that the speaker recognition rate based on the feature which is fused directly can reach 98.75%, which is 5% and 0.62% higher than the traditional feature and deep bottleneck feature, respectively. When the number of speakers increases, the fusion feature based on optimized weight coefficients can improve the recognition rate by 0.81%. It is validated that our proposed fusion method can effectively consider the complementarity of the different types of features and improve the speaker recognition rate.

Development of Quranic Reciter Identification System using MFCC and GMM Classifier

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i1.pp372-378 ◽

2018 ◽

Vol 8 (1) ◽

pp. 372 ◽

Cited By ~ 2

Author(s):

Teddy Surya Gunawan ◽

Nur Atikah Muhamat Saleh ◽

Mira Kartiwi

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Speaker Recognition ◽

Recognition Rate ◽

Identification Problem ◽

Gaussian Mixture ◽

Identification System ◽

Gmm Classifier ◽

Mel Frequency Cepstral Coefficient

Nowadays, there are many beautiful recitation of Al-Quran available. Quranic recitation has its own characteristics, and the problem to identify the reciter is similar to the speaker recognition/identification problem. The objective of this paper is to develop Quran reciter identification system using Mel-frequency Cepstral Coefficient (MFCC) and Gaussian Mixture Model (GMM). In this paper, a database of five Quranic reciters is developed and used in training and testing phases. We carefully randomized the database from various surah in the Quran so that the proposed system will not prone to the recited verses but only to the reciter. Around 15 Quranic audio samples from 5 reciters were collected and randomized, in which 10 samples were used for training the GMM and 5 samples were used for testing. Results showed that our proposed system has 100% recognition rate for the five reciters tested. Even when tested with unknown samples, the proposed system is able to reject it.

Speaker Identity Recognition by Acoustic and Visual Data Fusion through Personal Privacy for Smart Care and Service Applications

Journal of Imaging Science and Technology ◽

10.2352/j.imagingsci.technol.2020.64.4.040404 ◽

2020 ◽

Vol 64 (4) ◽

pp. 40404-1-40404-16

Author(s):

I.-J. Ding ◽

C.-M. Ruan

Keyword(s):

Face Detection ◽

Speaker Recognition ◽

Visual Information ◽

Classification Tree ◽

Gaussian Mixture ◽

Recognition Method ◽

Indoor Space ◽

Identity Recognition ◽

Visual Identity ◽

Speaker Classification

Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design. Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.

Speaker Verification Under Degraded Conditions Using Empirical Mode Decomposition Based Voice Activity Detection Algorithm

Journal of Intelligent Systems ◽

10.1515/jisys-2013-0085 ◽

2014 ◽

Vol 23 (4) ◽

pp. 359-378

Author(s):

M. S. Rudramurthy ◽

V. Kamakshi Prasad ◽

R. Kumaraswamy

Keyword(s):

Speaker Recognition ◽

Speaker Verification ◽

Signal To Noise Ratio ◽

Gaussian Mixture ◽

Detection Algorithm ◽

Voice Activity Detection ◽

Activity Detection ◽

Front End ◽

Different Types ◽

Voice Activity

AbstractThe performance of most of the state-of-the-art speaker recognition (SR) systems deteriorates under degraded conditions, owing to mismatch between the training and testing sessions. This study focuses on the front end of the speaker verification (SV) system to reduce the mismatch between training and testing. An adaptive voice activity detection (VAD) algorithm using zero-frequency filter assisted peaking resonator (ZFFPR) was integrated into the front end of the SV system. The performance of this proposed SV system was studied under degraded conditions with 50 selected speakers from the NIST 2003 database. The degraded condition was simulated by adding different types of noises to the original speech utterances. The different types of noises were chosen from the NOISEX-92 database to simulate degraded conditions at signal-to-noise ratio levels from 0 to 20 dB. In this study, widely used 39-dimension Mel frequency cepstral coefficient (MFCC; i.e., 13-dimension MFCCs augmented with 13-dimension velocity and 13-dimension acceleration coefficients) features were used, and Gaussian mixture model–universal background model was used for speaker modeling. The proposed system’s performance was studied against the energy-based VAD used as the front end of the SV system. The proposed SV system showed some encouraging results when EMD-based VAD was used at its front end.

CAPTCHA Recognition Method Based on CNN with Focal Loss

Complexity ◽

10.1155/2021/6641329 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Zhong Wang ◽

Peibei Shi

Keyword(s):

Neural Network ◽

Network Model ◽

Loss Function ◽

Neural Network Model ◽

Recognition Rate ◽

Recognition Method ◽

High Adhesion ◽

Recognition Ability ◽

Different Types ◽

Simple Neural Network

In order to distinguish between computers and humans, CAPTCHA is widely used in links such as website login and registration. The traditional CAPTCHA recognition method has poor recognition ability and robustness to different types of verification codes. For this reason, the paper proposes a CAPTCHA recognition method based on convolutional neural network with focal loss function. This method improves the traditional VGG network structure and introduces the focal loss function to generate a new CAPTCHA recognition model. First, we perform preprocessing such as grayscale, binarization, denoising, segmentation, and annotation and then use the Keras library to build a simple neural network model. In addition, we build a terminal end-to-end neural network model for recognition for complex CAPTCHA with high adhesion and more interference pixel. By testing the CNKI CAPTCHA, Zhengfang CAPTCHA, and randomly generated CAPTCHA, the experimental results show that the proposed method has a better recognition effect and robustness for three different datasets, and it has certain advantages compared with traditional deep learning methods. The recognition rate is 99%, 98.5%, and 97.84%, respectively.

Speaker recognition based on adapted Gaussian mixture model and static and dynamic auditory feature fusion

Optics and Precision Engineering ◽

10.3788/ope.20132106.1598 ◽

2013 ◽

Vol 21 (6) ◽

pp. 1598-1604 ◽

Cited By ~ 2

Author(s):

吴迪 WU Di ◽

曹洁 CAO Jie ◽

王进花 WANG Jin-hua

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Speaker Recognition ◽

Feature Fusion ◽

Gaussian Mixture ◽

Auditory Feature

One speaker recognition method based on feature fusion

2013 IEEE Third International Conference on Information Science and Technology (ICIST) ◽

10.1109/icist.2013.6747767 ◽

2013 ◽

Author(s):

Jinming Wang ◽

Yulong Xu ◽

Zhijun Xu ◽

Xue Ni

Keyword(s):

Speaker Recognition ◽

Feature Fusion ◽

Recognition Method

Human Face Expression Recognition Based on Feature Fusion

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.536-537.115 ◽

2014 ◽

Vol 536-537 ◽

pp. 115-120

Author(s):

Ting Gong ◽

Yu Biao Liu

Keyword(s):

Feature Fusion ◽

Rbf Neural Network ◽

Recognition Rate ◽

Gabor Wavelet ◽

Expression Recognition ◽

Human Face ◽

Face Expression ◽

Geometry Feature ◽

Feature Based ◽

Face Expression Recognition

The Gabor wavelet is the important technique widely used in the areas of images recognition such as human face expression, it extract the more important grain features for face expression effective, but it does not take into account the relative changes in the important characteristics of each location of the point features. Aiming at recognizing the information of human face expression, fuse the geometry feature based on angle changes at key parts on face expression, and then a radial basis function (RBF) neural network is designed as the classifier to perform recognition. The results of the experiment in the human face expression database indicate that the recognition rate by the feature fusion is obviously superior to that of traditional method.

Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings

Sensors ◽

10.3390/s19214709 ◽

2019 ◽

Vol 19 (21) ◽

pp. 4709 ◽

Cited By ~ 2

Author(s):

Woo Hyun Kang ◽

Nam Soo Kim

Keyword(s):

Speaker Recognition ◽

Latent Variable ◽

Nonlinear Process ◽

Gaussian Mixture ◽

Leibler Divergence ◽

Feature Extractor ◽

Total Variability ◽

Feature Based ◽

Model Training ◽

Increasing Demand

Over the recent years, various research has been conducted to investigate methods for verifying users with a short randomized pass-phrase due to the increasing demand for voice-based authentication systems. In this paper, we propose a novel technique for extracting an i-vector-like feature based on an adversarially learned inference (ALI) model which summarizes the variability within the Gaussian mixture model (GMM) distribution through a nonlinear process. Analogous to the previously proposed variational autoencoder (VAE)-based feature extractor, the proposed ALI-based model is trained to generate the GMM supervector according to the maximum likelihood criterion given the Baum–Welch statistics of the input utterance. However, to prevent the potential loss of information caused by the Kullback–Leibler divergence (KL divergence) regularization adopted in the VAE-based model training, the newly proposed ALI-based feature extractor exploits a joint discriminator to ensure that the generated latent variable and the GMM supervector are more realistic. The proposed framework is compared with the conventional i-vector and VAE-based methods using the TIDIGITS dataset. Experimental results show that the proposed method can represent the uncertainty caused by the short duration better than the VAE-based method. Furthermore, the proposed approach has shown great performance when applied in association with the standard i-vector framework.

Partial Discharge Pattern Recognition Method for GIS Based on GA-BPNN

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.738-739.397 ◽

2015 ◽

Vol 738-739 ◽

pp. 397-400

Author(s):

Xin Yan Feng ◽

Xiao Li Hu ◽

Jun Yong ◽

Bo Yang ◽

Xiao Bin Sun ◽

...

Keyword(s):

Neural Network ◽

Bp Neural Network ◽

Partial Discharge ◽

Recognition Rate ◽

Correct Identification ◽

Pattern Recognition Method ◽

Recognition Method ◽

Different Types ◽

Signal Peak ◽

Adaptive Momentum

In order to study different types of partial discharge inspired by defects in GIS and increase the rate of correct identification on defects, four kinds of typical insulation defects physical model are designed based on the insulation defects of 110 kV GIS and its partial discharge characteristics. Ten feature parameters including the signal peak and kurtosis are acquired from 222 groups of partial discharge signal data, and recognized by BP neural network which is optimized by input genetic algorithm. Recognition results show that this method works well, owning a higher recognition rate than adaptive momentum BP neural network

Research on a Microexpression Recognition Technology Based on Multimodal Fusion

Complexity ◽

10.1155/2021/5221950 ◽

2021 ◽

Vol 2021 ◽

pp. 1-15

Author(s):

Jie Kang ◽

Xiao Ying Chen ◽

Qi Yuan Liu ◽

Si Han Jin ◽

Cheng Han Yang ◽

...

Keyword(s):

National Security ◽

Feature Fusion ◽

Recognition Accuracy ◽

Recognition Rate ◽

Multimodal Fusion ◽

Recognition Method ◽

Laboratory Environment ◽

Comparison Results ◽

Model Training ◽

Lstm Network

Microexpressions have extremely high due value in national security, public safety, medical, and other fields. However, microexpressions have characteristics that are obviously different from macroexpressions, such as short duration and weak changes, which greatly increase the difficulty of microexpression recognition work. In this paper, we propose a microexpression recognition method based on multimodal fusion through a comparative study of traditional microexpression recognition algorithms such as LBP algorithm and CNN and LSTM deep learning algorithms. The method couples the separate microexpression image information with the corresponding body temperature information to establish a multimodal fusion microexpression database. This paper firstly introduces how to build a multimodal fusion microexpression database in a laboratory environment, secondly compares the recognition accuracy of LBP, LSTM, and CNN + LSTM networks for microexpressions, and finally selects the superior CNN + LSTM network in the comparison results for model training and testing on the test set under separate microexpression database and multimodal fusion database. The experimental results show that a microexpression recognition method based on multimodal fusion designed in this paper is more accurate than unimodal recognition in multimodal recognition after feature fusion, and its recognition rate reaches 75.1%, which proves that the method is feasible and effective in improving microexpression recognition rate and has good practical value.