Microsoft Speaker Diarization System for the Voxceleb Speaker Recognition Challenge 2020

Radio is evolving in a changing digital media ecosystem. Audio-on-demand has shaped the landscape of big unstructured audio data available online. In this paper, a framework for knowledge extraction is introduced, to improve discoverability and enrichment of the provided content. A web application for live radio production and streaming is developed. The application offers typical live mixing and broadcasting functionality, while performing real-time annotation as a background process by logging user operation events. For the needs of a typical radio station, a supervised speaker classification model is trained for the recognition of 24 known speakers. The model is based on a convolutional neural network (CNN) architecture. Since not all speakers are known in radio shows, a CNN-based speaker diarization method is also proposed. The trained model is used for the extraction of fixed-size identity d-vectors. Several clustering algorithms are evaluated, having the d-vectors as input. The supervised speaker recognition model for 24 speakers scores an accuracy of 88.34%, while unsupervised speaker diarization scores a maximum accuracy of 87.22%, as tested on an audio file with speech segments from three unknown speakers. The results are considered encouraging regarding the applicability of the proposed methodology.

Download Full-text

Robust Speaker Diarization Based on Daubechies Wavelet, Nonlinear Energy Operator and Pyknogram

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d8535.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 5753-5759

Keyword(s):

Speech Processing ◽

Speaker Recognition ◽

Energy Operator ◽

Speaker Diarization ◽

Automatic Speaker Recognition ◽

Significant Area ◽

Discrete Wavelets ◽

Information Change ◽

Nonlinear Energy ◽

Nonlinear Energy Operator

Two common disciplines of speech processing are speaker recognition “identification and verification of speaker”, and speaker diarization, “who spoke when”. Motivated by various applications in automatic speaker recognition, speaker indexing, word counting, and audio transcription, speaker diarization (SD) becomes a significant area of signal processing. The basic designing steps of SD are feature extraction, voice activity detection (VAD), segmentation, and clustering. VAD process is accomplished by Daubechies 40, discrete wavelets transform (DWT). Initially, DWT was used for compression, scaling, and denoising of audio-stream and then partitioned into small frames of size 0.12 seconds. Next, features of each frame were extracted by applying nonlinear energy operator (NEO) based pyknogram. To measure the similarity between frames, a sliding window on delta-BIC distance metric was applied. A negative value of its output represents the same segments and vice-versa. To improve the output of the segmentation process, resegmentation was applied by information change rate method. At last, hierarchical clustering groups the homogeneous segments that correspond to a particular speaker and has been graphically represented by the dendrogram. The performance of SD was evaluated by F-measure and speaker diarization error rate (SER) and their results were compared with the traditional speaker diarization system that uses MFCC and BIC for segmentation and clustering. It reveals a significant reduction of 12.3% of SER in the proposed diarization system.

Download Full-text

An Overview of Speaker Recognition and Implementation of Speaker Diarization with Transcription

International Journal of Computer Applications ◽

10.5120/ijca2020920867 ◽

2020 ◽

Vol 175 (31) ◽

pp. 1-6

Author(s):

Arthav Mane ◽

Janhavi Bhopale ◽

Ria Motghare ◽

Priya Chimurkar

Keyword(s):

Speaker Recognition ◽

Speaker Diarization

Download Full-text

Machine Learning for Speaker Recognition

10.1017/9781108552332 ◽

2020 ◽

Cited By ~ 2

Author(s):

Man-Wai Mak ◽

Jen-Tzung Chien

Keyword(s):

Machine Learning ◽

Speaker Recognition

Download Full-text

Integrated approach to speaker recognition in forensic applications

International Journal of Speech Language and the Law ◽

10.1558/ijsll.v3i1.50 ◽

2013 ◽

Vol 3 (1) ◽

pp. 50-64

Author(s):

Wojciech Majewski ◽

Czeslaw Basztura

Keyword(s):

Speaker Recognition ◽

Integrated Approach

Download Full-text

TEXT-INDEPENDENT SPEAKER RECOGNITION USING COMBINED LPC AND MFC COEFFICIENTS

International Journal of Research in Engineering and Technology ◽

10.15623/ijret.2014.0306095 ◽

2014 ◽

Vol 03 (06) ◽

pp. 508-514

Author(s):

PPS Subhashini .

Keyword(s):

Speaker Recognition

Download Full-text

Speaker Identity Recognition by Acoustic and Visual Data Fusion through Personal Privacy for Smart Care and Service Applications

Journal of Imaging Science and Technology ◽

10.2352/j.imagingsci.technol.2020.64.4.040404 ◽

2020 ◽

Vol 64 (4) ◽

pp. 40404-1-40404-16

Author(s):

I.-J. Ding ◽

C.-M. Ruan

Keyword(s):

Face Detection ◽

Speaker Recognition ◽

Visual Information ◽

Classification Tree ◽

Gaussian Mixture ◽

Recognition Method ◽

Indoor Space ◽

Identity Recognition ◽

Visual Identity ◽

Speaker Classification

Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design. Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.

Download Full-text

Speaker recognition based on dynamic time warping and Gaussian mixture model

2020 39th Chinese Control Conference (CCC) ◽

10.23919/ccc50068.2020.9188632 ◽

2020 ◽

Author(s):

Nannan Zhang ◽

Yanru Yao

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Speaker Recognition ◽

Dynamic Time Warping ◽

Gaussian Mixture ◽

Time Warping ◽

Dynamic Time

Download Full-text

New Feature Vectors using GFCC for Speaker Identification

International Journal of Emerging Research in Management and Technology ◽

10.23956/ijermt.v6i8.146 ◽

2018 ◽

Vol 6 (8) ◽

pp. 243

Author(s):

A. Nagesh

Keyword(s):

Speaker Recognition ◽

Speaker Identification ◽

Signal To Noise Ratio ◽

Main Idea ◽

Extraction Methods ◽

Identification System ◽

Identification Performance ◽

Feature Vectors ◽

Overall Performance ◽

New Feature

The feature vectors of speaker identification system plays a crucial role in the overall performance of the system. There are many new feature vectors extraction methods based on MFCC, but ultimately we want to maximize the performance of SID system. The objective of this paper to derive Gammatone Frequency Cepstral Coefficients (GFCC) based a new set of feature vectors using Gaussian Mixer model (GMM) for speaker identification. The MFCC are the default feature vectors for speaker recognition, but they are not very robust at the presence of additive noise. The GFCC features in recent studies have shown very good robustness against noise and acoustic change. The main idea is GFCC features based on GMM feature extraction is to improve the overall speaker identification performance in low signal to noise ratio (SNR) conditions.

Download Full-text