A novel i-vector framework using multiple features and PCA for speaker recognition in short speech condition

Author(s):  
Chi Zhang ◽  
Xiaoqiang Li ◽  
Wei Li ◽  
Peizhong Lu ◽  
Wenqiang Zhang
1995 ◽  
Vol 38 (5) ◽  
pp. 1014-1024 ◽  
Author(s):  
Robert L. Whitehead ◽  
Nicholas Schiavetti ◽  
Brenda H. Whitehead ◽  
Dale Evan Metz

The purpose of this investigation was twofold: (a) to determine if there are changes in specific temporal characteristics of speech that occur during simultaneous communication, and (b) to determine if known temporal rules of spoken English are disrupted during simultaneous communication. Ten speakers uttered sentences consisting of a carrier phrase and experimental CVC words under conditions of: (a) speech, (b) speech combined with signed English, and (c) speech combined with signed English for every word except the CVC word that was fingerspelled. The temporal features investigated included: (a) sentence duration, (b) experimental CVC word duration, (c) vowel duration in experimental CVC words, (d) pause duration before and after experimental CVC words, and (e) consonantal effects on vowel duration. Results indicated that for all durational measures, the speech/sign/fingerspelling condition was longest, followed by the speech/sign condition, with the speech condition being shortest. It was also found that for all three speaking conditions, vowels were longer in duration when preceding voiced consonants than vowels preceding their voiceless cognates, and that a low vowel was longer in duration than a high vowel. These findings indicate that speakers consistently reduced their rate of speech when using simultaneous communication, but did not violate these specific temporal rules of English important for consonant and vowel perception.


1998 ◽  
Vol 14 (3) ◽  
pp. 202-210 ◽  
Author(s):  
Suzanne Skiffington ◽  
Ephrem Fernandez ◽  
Ken McFarland

This study extends previous attempts to assess emotion with single adjective descriptors, by examining semantic as well as cognitive, motivational, and intensity features of emotions. The focus was on seven negative emotions common to several emotion typologies: anger, fear, sadness, shame, pity, jealousy, and contempt. For each of these emotions, seven items were generated corresponding to cognitive appraisal about the self, cognitive appraisal about the environment, action tendency, action fantasy, synonym, antonym, and intensity range of the emotion, respectively. A pilot study established that 48 of the 49 items were linked predominantly to the specific emotions as predicted. The main data set comprising 700 subjects' ratings of relatedness between items and emotions was subjected to a series of factor analyses, which revealed that 44 of the 49 items loaded on the emotion constructs as predicted. A final factor analysis of these items uncovered seven factors accounting for 39% of the variance. These emergent factors corresponded to the hypothesized emotion constructs, with the exception of anger and fear, which were somewhat confounded. These findings lay the groundwork for the construction of an instrument to assess emotions multicomponentially.


2020 ◽  
Vol 64 (4) ◽  
pp. 40404-1-40404-16
Author(s):  
I.-J. Ding ◽  
C.-M. Ruan

Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design. Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.


Author(s):  
A. Nagesh

The feature vectors of speaker identification system plays a crucial role in the overall performance of the system. There are many new feature vectors extraction methods based on MFCC, but ultimately we want to maximize the performance of SID system.  The objective of this paper to derive Gammatone Frequency Cepstral Coefficients (GFCC) based a new set of feature vectors using Gaussian Mixer model (GMM) for speaker identification. The MFCC are the default feature vectors for speaker recognition, but they are not very robust at the presence of additive noise. The GFCC features in recent studies have shown very good robustness against noise and acoustic change. The main idea is  GFCC features based on GMM feature extraction is to improve the overall speaker identification performance in low signal to noise ratio (SNR) conditions.


Sign in / Sign up

Export Citation Format

Share Document