Simultaneous Flexible Keyword Detection and Text-dependent Speaker Recognition for Low-resource Devices

Author(s):  
Hiroshi Fujimura ◽  
Ning Ding ◽  
Daichi Hayakawa ◽  
Takehiko Kagoshima
2019 ◽  
Vol 8 (3) ◽  
pp. 3290-3294

Speech and speaker recognition has yet not achieved the state-of-the-art position. Keyword detection in audio clips is gaining importance as it contributes to the audio recognition and detection systems. In this area, very few works have been carried out. In this paper, we present our experiment on keyword detection within recorded news clips. It is based on Assamese language spoken by Assamese native speakers. For this experiment, the audio clips are collected from local TV news debates, whereas the keywords are recorded by random speakers. The keywords are selected for recording considering the fact that they appear somewhere within the audio clips for a finite number of times. Mel Frequency Cepstral Coefficient (MFCC) is considered as feature and Auto Associative Neural Network (AANN) is considered as the classifier tool. With this detection model an average accuracy of 87% is achieved.


2016 ◽  
Vol 03 (02) ◽  
pp. 079-083
Author(s):  
Lawrence Mbuagbaw ◽  
Francisca Monebenimp ◽  
Bolaji Obadeyi ◽  
Grace Bissohong ◽  
Marie-Thérèse Obama ◽  
...  

2018 ◽  
Vol 4 (1) ◽  
pp. 295-313 ◽  
Author(s):  
Karley A Riffe

Faculty work now includes market-like behaviors that create research, teaching, and service opportunities. This study employs an embedded case study design to evaluate the extent to which faculty members interact with external organizations to mitigate financial constraints and how those relationships vary by academic discipline. The findings show a similar number of ties among faculty members in high- and low-resource disciplines, reciprocity between faculty members and external organizations, and an expanded conceptualization of faculty work.


Diabetes ◽  
2018 ◽  
Vol 67 (Supplement 1) ◽  
pp. 93-LB
Author(s):  
EDDY JEAN BAPTISTE ◽  
PHILIPPE LARCO ◽  
MARIE-NANCY CHARLES LARCO ◽  
JULIA E. VON OETTINGEN ◽  
EDDLYS DUBOIS ◽  
...  

2020 ◽  
Vol 64 (4) ◽  
pp. 40404-1-40404-16
Author(s):  
I.-J. Ding ◽  
C.-M. Ruan

Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design. Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.


Sign in / Sign up

Export Citation Format

Share Document