Speaker Identity Recognition by Acoustic and Visual Data Fusion through Personal Privacy for Smart Care and Service Applications

Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design. Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.

Download Full-text

Speaker Recognition Based on Fusion of a Deep and Shallow Recombination Gaussian Supervector

Electronics ◽

10.3390/electronics10010020 ◽

2020 ◽

Vol 10 (1) ◽

pp. 20

Author(s):

Linhui Sun ◽

Yunyi Bu ◽

Bo Zou ◽

Sheng Fu ◽

Pingan Li

Keyword(s):

Speaker Recognition ◽

Feature Fusion ◽

Recognition Rate ◽

Gaussian Mixture ◽

Recognition Method ◽

Different Types ◽

Feature Based ◽

Mel Frequency Cepstral Coefficient ◽

Fusion Feature ◽

Weight Coefficients

Extracting speaker’s personalized feature parameters is vital for speaker recognition. Only one kind of feature cannot fully reflect the speaker’s personality information. In order to represent the speaker’s identity more comprehensively and improve speaker recognition rate, we propose a speaker recognition method based on the fusion feature of a deep and shallow recombination Gaussian supervector. In this method, the deep bottleneck features are first extracted by Deep Neural Network (DNN), which are used for the input of the Gaussian Mixture Model (GMM) to obtain the deep Gaussian supervector. On the other hand, we input the Mel-Frequency Cepstral Coefficient (MFCC) to GMM directly to extract the traditional Gaussian supervector. Finally, the two categories of features are combined in the form of horizontal dimension augmentation. In addition, when the number of speakers to be recognized increases, in order to prevent the system recognition rate from falling sharply, we introduce the optimization algorithm to find the optimal weight before the feature fusion. The experiment results indicate that the speaker recognition rate based on the feature which is fused directly can reach 98.75%, which is 5% and 0.62% higher than the traditional feature and deep bottleneck feature, respectively. When the number of speakers increases, the fusion feature based on optimized weight coefficients can improve the recognition rate by 0.81%. It is validated that our proposed fusion method can effectively consider the complementarity of the different types of features and improve the speaker recognition rate.

Download Full-text

Speaker recognition based on dynamic time warping and Gaussian mixture model

2020 39th Chinese Control Conference (CCC) ◽

10.23919/ccc50068.2020.9188632 ◽

2020 ◽

Author(s):

Nannan Zhang ◽

Yanru Yao

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Speaker Recognition ◽

Dynamic Time Warping ◽

Gaussian Mixture ◽

Time Warping ◽

Dynamic Time

Download Full-text

Duration weighted Gaussian Mixture Model supervector modeling for robust speaker recognition

2013 Ninth International Conference on Natural Computation (ICNC) ◽

10.1109/icnc.2013.6817977 ◽

2013 ◽

Author(s):

Zhe Ji ◽

Wei Hou ◽

Xin Jin ◽

Zhi-Yi Li

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Speaker Recognition ◽

Gaussian Mixture ◽

Robust Speaker Recognition

Download Full-text

A Proposed Speaker Recognition Method B Based on Long-Term Voice Features and Fuzzy Logic

Engineering and Technology Journal ◽

10.30684/etj.v39i1b.343 ◽

2021 ◽

Vol 39 (1B) ◽

pp. 1-10

Author(s):

Iman H. Hadi ◽

Alia K. Abdul-Hassan

Keyword(s):

Fuzzy Logic ◽

Speaker Recognition ◽

Recognition Accuracy ◽

Inner Product ◽

Maximum Frequency ◽

Recognition Method ◽

Data Set ◽

Zero Crossing ◽

Zero Crossing Rate

Speaker recognition depends on specific predefined steps. The most important steps are feature extraction and features matching. In addition, the category of the speaker voice features has an impact on the recognition process. The proposed speaker recognition makes use of biometric (voice) attributes to recognize the identity of the speaker. The long-term features were used such that maximum frequency, pitch and zero crossing rate (ZCR). In features matching step, the fuzzy inner product was used between feature vectors to compute the matching value between a claimed speaker voice utterance and test voice utterances. The experiments implemented using (ELSDSR) data set. These experiments showed that the recognition accuracy is 100% when using text dependent speaker recognition.

Download Full-text

Comparison of feature extraction and normalization methods for speaker recognition using grid-audiovisual database

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v18.i2.pp782-789 ◽

2020 ◽

Vol 18 (2) ◽

pp. 782

Author(s):

Musab T. S. Al-Kaltakchi ◽

Haithem Abd Al-Raheem Taha ◽

Mohanad Abd Shehab ◽

Mohamed A.M. Abdullah

Keyword(s):

Feature Extraction ◽

Speaker Recognition ◽

Speaker Identification ◽

Gaussian Mixture ◽

Identification Accuracy ◽

Identification System ◽

Good Representation ◽

Mel Frequency Cepstral Coefficients ◽

Normalization Methods ◽

Cepstral Coefficients

<p><span lang="EN-GB">In this paper, different feature extraction and feature normalization methods are investigated for speaker recognition. With a view to give a good representation of acoustic speech signals, Power Normalized Cepstral Coefficients (PNCCs) and Mel Frequency Cepstral Coefficients (MFCCs) are employed for feature extraction. Then, to mitigate the effect of linear channel, Cepstral Mean-Variance Normalization (CMVN) and feature warping are utilized. The current paper investigates Text-independent speaker identification system by using 16 coefficients from both the MFCCs and PNCCs features. Eight different speakers are selected from the GRID-Audiovisual database with two females and six males. The speakers are modeled using the coupling between the Universal Background Model and Gaussian Mixture Models (GMM-UBM) in order to get a fast scoring technique and better performance. The system shows 100% in terms of speaker identification accuracy. The results illustrated that PNCCs features have better performance compared to the MFCCs features to identify females compared to male speakers. Furthermore, feature wrapping reported better performance compared to the CMVN method. </span></p>

Download Full-text

Application of differential evolution optimization based Gaussian Mixture Models to speaker recognition

The 26th Chinese Control and Decision Conference (2014 CCDC) ◽

10.1109/ccdc.2014.6852935 ◽

2014 ◽

Cited By ~ 2

Author(s):

Hong Zhou ◽

JianHua Zhang

Keyword(s):

Differential Evolution ◽

Mixture Models ◽

Speaker Recognition ◽

Gaussian Mixture Models ◽

Gaussian Mixture

Download Full-text

Nonparametric Speaker Recognition Method Using Earth Mover's Distance

IEICE Transactions on Information and Systems ◽

10.1093/ietisy/e89-d.3.1074 ◽

2006 ◽

Vol E89-D (3) ◽

pp. 1074-1081 ◽

Cited By ~ 4

Author(s):

S. KUROIWA

Keyword(s):

Speaker Recognition ◽

Earth Mover’S Distance ◽

Recognition Method ◽

Earth Mover's Distance

Download Full-text

AWA Long-Term Recorded Speech Corpus And Robust Speaker Recognition Method For Session Variability

2018 Oriental COCOSDA - International Conference on Speech Database and Assessments ◽

10.1109/icsda.2018.8693004 ◽

2018 ◽

Author(s):

Satoru TSUGE ◽

Shingo KUROIWA ◽

Tomoko OHSUGA ◽

Yuichi ISHIMOTO

Keyword(s):

Speaker Recognition ◽

Recognition Method ◽

Speech Corpus ◽

Robust Speaker Recognition

Download Full-text

A GAUSSIAN MIXTURE MODEL-BASED SPEAKER RECOGNITION SYSTEM

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.19596 ◽

2017 ◽

Vol 10 (13) ◽

pp. 140

Author(s):

Kumari Piu Gorai ◽

Thomas Abraham

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Speaker Recognition ◽

Gaussian Mixture ◽

Recognition System ◽

Human Being ◽

Biometric System ◽

Voice Signal ◽

Signal Characteristics ◽

Authentication Technique

A human being has lot of unique features and one of them is voice. Speaker recognition is the use of a system to distinguish and identify a person from his/her vocal sound. A speaker recognition system (SRS) can be used as one of the authentication technique, in addition to the conventional authentication methods. This paper represents the overview of voice signal characteristics and speaker recognition techniques. It also discusses the advantages and problem of current SRS. The only biometric system that allows users to authenticate remotely is voice-based SRS, we are in the need of a robust SRS.

Download Full-text

A toll evasion recognition method based on Gaussian mixture clustering

Communications in Statistics - Simulation and Computation ◽

10.1080/03610918.2020.1725819 ◽

2020 ◽

pp. 1-11

Author(s):

Yong-He Zhao ◽

Hui-feng Shi ◽

Xiao-Chen Ren ◽

Bing-Jie Jiao

Keyword(s):

Gaussian Mixture ◽

Recognition Method

Download Full-text