Mean-Delta Features for Telephone Speech Endpoint Detection

2014 ◽  
Vol 12 (3-4) ◽  
pp. 36-44
Author(s):  
A. Ouzounov

Abstract In this paper, a brief summary of the author’s research in the field of the contour-based telephone speech Endpoint Detection (ED) is presented. This research includes: development of new robust features for ED – the Mean-Delta feature and the Group Delay Mean-Delta feature and estimation of the effect of the analyzed ED features and two additional features in the Dynamic Time Warping fixed-text speaker verification task with short noisy telephone phrases in Bulgarian language.

2014 ◽  
Vol 14 (2) ◽  
pp. 127-139 ◽  
Author(s):  
Atanas Ouzounov

Abstract In the study the efficiency of three features for trajectory-based endpoint detection is experimentally evaluated in the fixed-text Dynamic Time Warping (DTW) - a based speaker verification task with short phrases of telephone speech. The employed features are Modified Teager Energy (MTE), Energy-Entropy (EE) feature and Mean-Delta (MD) feature. The utterance boundaries in the endpoint detector are provided by means of state automaton and a set of thresholds based only on trajectory characteristics. The training and testing have been done with noisy telephone speech (short phrases in Bulgarian language with length of about 2 s) selected from BG-SRDat corpus. The results of the experiments have shown that the MD feature demonstrates the best performance in the endpoint detection tests in terms of the verification rate.


2017 ◽  
Vol 17 (4) ◽  
pp. 114-133
Author(s):  
Atanas Ouzounov

AbstractThis paper proposes a new contour-based speech endpoint detector which combines the log-Group Delay Mean-Delta (log-GDMD) feature, an adaptive twothreshold scheme and an eight-state automaton. The adaptive thresholds scheme uses two pairs of thresholds - for the starting and for the ending points, respectively. Each pair of thresholds is calculated by using the contour characteristics in the corresponded region of the utterance. The experimental results have shown that the proposed detector demonstrates better performance compared to the Long-Term Spectral Divergence (LTSD) one in terms of endpoint accuracy. Additional fixed-text speaker verification tests with short phrases of telephone speech based on the Dynamic Time Warping (DTW) and left-to-right Hidden Markov Model (HMM) frameworks confirm the improvements of the verification rate due to the better endpoint accuracy.


2019 ◽  
Vol 9 (13) ◽  
pp. 2636 ◽  
Author(s):  
Yan Shi ◽  
Juanjuan Zhou ◽  
Yanhua Long ◽  
Yijie Li ◽  
Hongwei Mao

The automatic speaker verification (ASV) has achieved significant progress in recent years. However, it is still very challenging to generalize the ASV technologies to new, unknown and spoofing conditions. Most previous studies focused on extracting the speaker information from natural speech. This paper attempts to address the speaker verification from another perspective. The speaker identity information was exploited from singing speech. We first designed and released a new corpus for speaker verification based on singing and normal reading speech. Then, the speaker discrimination was compared and analyzed between natural and singing speech in different feature spaces. Furthermore, the conventional Gaussian mixture model, the dynamic time warping and the state-of-the-art deep neural network were investigated. They were used to build text-dependent ASV systems with different training-test conditions. Experimental results show that the voiceprint information in the singing speech was more distinguishable than the one in the normal speech. More than relative 20% reduction of equal error rate was obtained on both the gender-dependent and independent 1 s-1 s evaluation tasks.


Author(s):  
Vincent Wan

This chapter describes the adaptation and application of kernel methods for speech processing. It is divided into two sections dealing with speaker verification and isolated-word speech recognition applications. Significant advances in kernel methods have been realised in the field of speaker verification, particularly relating to the direct scoring of variable-length speech utterances by sequence kernel SVMs. The improvements are so substantial that most state-of-the-art speaker recognition systems now incorporate SVMs. We describe the architecture of some of these sequence kernels. Speech recognition presents additional challenges to kernel methods and their application in this area is not as straightforward as for speaker verification. We describe a sequence kernel that uses dynamic time warping to capture temporal information within the kernel directly. The formulation also extends the standard dynamic time-warping algorithm by enabling the dynamic alignment to be computed in a high-dimensional space induced by a kernel function. This kernel is shown to work well in an application for recognising low-intelligibility speech of severely dysarthric individuals.


Author(s):  
Vincent Wan

This chapter describes the adaptation and application of kernel methods for speech processing. It is divided into two sections dealing with speaker verification and isolated-word speech recognition applications. Significant advances in kernel methods have been realised in the field of speaker verification, particularly relating to the direct scoring of variable-length speech utterances by sequence kernel SVMs. The improvements are so substantial that most state-of-the-art speaker recognition systems now incorporate SVMs. We describe the architecture of some of these sequence kernels. Speech recognition presents additional challenges to kernel methods and their application in this area is not as straightforward as for speaker verification. We describe a sequence kernel that uses dynamic time warping to capture temporal information within the kernel directly. The formulation also extends the standard dynamic time-warping algorithm by enabling the dynamic alignment to be computed in a high-dimensional space induced by a kernel function. This kernel is shown to work well in an application for recognising low-intelligibility speech of severely dysarthric individuals.


2021 ◽  
Vol 2021 ◽  
pp. 1-19
Author(s):  
Chekhaprabha Priyadarshanee Waduge ◽  
Naleen Chaminda Ganegoda ◽  
Darshana Chitraka Wickramarachchi ◽  
Ravindra Shanthakumar Lokupitiya

Summarizing or averaging a sequential data set (i.e., a set of time series) can be comprehensively approached as a result of sophisticated computational tools. Averaging under Dynamic Time Warping (DTW) is one such tool that captures consensus patterns. DTW acts as a similarity measure between time series, and subsequently, an averaging method must be executed upon the behaviour of DTW. However, averaging under DTW somewhat neglects temporal aspect since it is on the search of similar appearances rather than stagnating on corresponding time-points. On the contrary, the mean series carrying point-wise averages provides only a weak consensus pattern as it may over-smooth important temporal variations. As a compromise, a pool of consensus series termed Ultimate Tamed Series (UTS) is studied here that adheres to temporal decomposition supported by the discrete Haar wavelet. We claim that UTS summarizes localized patterns, which would not be reachable via the series under DTW or the mean series. Neighbourhood of localization can be altered as a user can customize different levels of decomposition. In validation, comparisons are carried out with the series under DTW and the mean series via Euclidean distance and the distance resulted by DTW itself. Two sequential data sets are selected for this purpose from a standard repository.


Sign in / Sign up

Export Citation Format

Share Document