speech segment Latest Research Papers

Speaker recognition deals with recognizing speakers by their speech. Most speaker recognition systems are built upon two stages, the first stage extracts low dimensional correlation embeddings from speech, and the second performs the classification task. The robustness of a speaker recognition system mainly depends on the extraction process of speech embeddings, which are primarily pre-trained on a large-scale dataset. As the embedding systems are pre-trained, the performance of speaker recognition models greatly depends on domain adaptation policy, which may reduce if trained using inadequate data. This paper introduces a speaker recognition strategy dealing with unlabeled data, which generates clusterable embedding vectors from small fixed-size speech frames. The unsupervised training strategy involves an assumption that a small speech segment should include a single speaker. Depending on such a belief, a pairwise constraint is constructed with noise augmentation policies, used to train AutoEmbedder architecture that generates speaker embeddings. Without relying on domain adaption policy, the process unsupervisely produces clusterable speaker embeddings, termed unsupervised vectors (u-vectors). The evaluation is concluded in two popular speaker recognition datasets for English language, TIMIT, and LibriSpeech. Also, a Bengali dataset is included to illustrate the diversity of the domain shifts for speaker recognition systems. Finally, we conclude that the proposed approach achieves satisfactory performance using pairwise architectures.

Download Full-text

An improved voice activity detection method based on spectral features and neural network

INTER-NOISE and NOISE-CON Congress and Conference Proceedings ◽

10.3397/in-2021-2747 ◽

2021 ◽

Vol 263 (2) ◽

pp. 4570-4580

Author(s):

Liu Ting ◽

Luo Xinwei

Keyword(s):

Neural Network ◽

Speech Signal ◽

Signal To Noise Ratio ◽

Threshold Value ◽

Low Complexity ◽

Noise Signal ◽

Sequential Decision ◽

Signal To Noise ◽

Speech Segment ◽

Double Threshold

The recognition accuracy of speech signal and noise signal is greatly affected under low signal-to-noise ratio. The neural network with parameters obtained from the training set can achieve good results in the existing data, but is poor for the samples with different the environmental noises. This method firstly extracts the features based on the physical characteristics of the speech signal, which have good robustness. It takes the 3-second data as samples, judges whether there is speech component in the data under low signal-to-noise ratios, and gives a decision tag for the data. If a reasonable trajectory which is like the trajectory of speech is found, it is judged that there is a speech segment in the 3-second data. Then, the dynamic double threshold processing is used for preliminary detection, and then the global double threshold value is obtained by K-means clustering. Finally, the detection results are obtained by sequential decision. This method has the advantages of low complexity, strong robustness, and adaptability to multi-national languages. The experimental results show that the performance of the method is better than that of traditional methods under various signal-to-noise ratios, and it has good adaptability to multi language.

Download Full-text

The Structure of Phonological Networks and Social Identity of Heritage Languages

JL3T ( Journal of Linguistics, Literature and Language Teaching) ◽

10.32505/jl3t.v6i2.1981 ◽

2021 ◽

Vol 6 (2) ◽

pp. 89-101

Author(s):

Mohd Hamid Raza

Keyword(s):

Social Identity ◽

Secondary Data ◽

Social Characteristics ◽

Primary Data ◽

Speech Community ◽

Heritage Languages ◽

Speech Communities ◽

The Social ◽

Speech Segment ◽

The Difference

This paper provides the basic information of the phonological networks and social identity about the heritage languages. The phonological networks convey the classification of the sound systems, while the social identity declares the difference among the native speakers of the heritage languages. The problem is investigated that how a particular speech segment created the variation among the speakers of the different languages in the speech communities. The objective of this paper is to determine the unique segments of the heritage languages and how these segments clear the social identity of the speakers in a particular speech community. The researcher collected the sample of primary and secondary data from the gadgets and the speakers of the heritage languages. The sample of data goes to the social characteristics of ages between twenty and forty of the respondents both male and female. The data are collected through observation, interview and the available literature of the heritage languages. For the collection of primary data, the high quality of the tape recorder is used and put approach to the mouth of the respondents for the recording at the time of interview. After the data collection, it is analysed base on the aspects of phonetics and phonology to find out the social identity of the respondents. In the result, it is found out that one particular speech segment represented the social identity of the speakers. In the framework of conclusion, it is represented that Urdu has different types of the speech segments covered all the processes of production, transmission, and perception.

Download Full-text

TRANSLATION TECHNIQUES OF STYLISTIC PHENOMENA IN A TV INTERVIEWER’S SPEECH

Naukovy Visnyk of South Ukrainian National Pedagogical University named after K D Ushynsky Linguistic Sciences ◽

10.24195/2616-5317-2020-31-20 ◽

2020 ◽

Vol 2020 (31) ◽

pp. 312-327

Author(s):

Olexandra Popova ◽

Oleg Bolgar ◽

Tomashevska Anastasiia

Keyword(s):

Foreign Language ◽

Native Language ◽

Language Translation ◽

Speech Segment ◽

Pragmatic Function ◽

Conversational Style ◽

Translation Techniques ◽

Television Interview ◽

Linguistic Units ◽

Correct Translation

The work is devoted to the study of translation peculiarities of stylistic features of a TV interviewer’s English speech into Ukrainian. The correct translation of the content of a speech segment requires good knowledge of vocabulary, and the capability to recognize the stylistic features of a foreign language in communication, and the techniques used to translate them into the native language. Translation of metaphors, metonymy, comparisons is of particular difficulty for the translator, however, it is the ability to use various techniques in translation that helps the translator to convey the meaning of the statement to the listener adequately. The television interview is characterized as an independent journalistic genre, is a kind of television communication, a purposeful individual-social speech phenomenon, which consists of the organized interaction detween the speakers and finds its expression in a specific dialogically constructed form. The results of this study show that each translated stylistic unit is characterized by a set of transformations typical for it. In most cases, these transformations involve lexical and syntactic stylistic devices. At the level of vocabulary, the speech of a TV interviewer is characterized by a significant number of colloquialisms, colloquial cliches, phraseological units. At the level of syntax, the typical indicators of the conversational style of the TV nterviewer are parallel constructions, repeated requests, elliptical sentences, repetitions, unfinished phrases, and the absence of inversion in interrogative sentences. The information obtained as the result of convergence of stylistic devices as a set of components participating together with other linguistic units in the formation of expressiveness, emotiveness and evaluation is one of the important sources of the language pragmatic function.

Download Full-text

A Preprocessing Strategy for Denoising of Speech Data Based on Speech Segment Detection

Applied Sciences ◽

10.3390/app10207385 ◽

2020 ◽

Vol 10 (20) ◽

pp. 7385

Author(s):

Seung-Jun Lee ◽

Hyuk-Yoon Kwon

Keyword(s):

Performance Evaluation ◽

Large Scale ◽

Detection Method ◽

Detection Methods ◽

Evaluation Metrics ◽

Speech Segment ◽

Speech Data ◽

Target Environment ◽

Speech Segments ◽

Different Levels

In this paper, we propose a preprocessing strategy for denoising of speech data based on speech segment detection. A design of computationally efficient speech denoising is necessary to develop a scalable method for large-scale data sets. Furthermore, it becomes more important as the deep learning-based methods have been developed because they require significant costs while showing high performance in general. The basic idea of the proposed method is using the speech segment detection so as to exclude non-speech segments before denoising. The speech segmentation detection can exclude non-speech segments with a negligible cost, which will be removed in denoising process with a much higher cost, while maintaining the accuracy of denoising. First, we devise a framework to choose the best preprocessing method for denoising based on the speech segment detection for a target environment. For this, we speculate the environments for denoising using different levels of signal-to-noise ratio (SNR) and multiple evaluation metrics. The framework finds the best speech segment detection method tailored to a target environment according to the performance evaluation of speech segment detection methods. Next, we investigate the accuracy of the speech segment detection methods extensively. We conduct the performance evaluation of five speech segment detection methods with different levels of SNRs and evaluation metrics. Especially, we show that we can adjust the accuracy between the precision and recall of each method by controlling a parameter. Finally, we incorporate the best speech segment detection method for a target environment into a denoising process. Through extensive experiments, we show that the accuracy of the proposed scheme is comparable to or even better than that of Wavenet-based denoising, which is one of recent advanced denoising methods based on deep neural networks, in terms of multiple evaluation metrics of denoising, i.e., SNR, STOI, and PESQ, while it can reduce the denoising time of the Wavenet-based denoising by approximately 40–50% according to the used speech segment detection method.

Download Full-text

A Study on the Robustness of Pitch-Range Estimation from Brief Speech Segments

International Journal of Asian Language Processing ◽

10.1142/s2717554520500034 ◽

2020 ◽

Vol 30 (01) ◽

pp. 2050003

Author(s):

Wenjie Peng ◽

Kaiqi Fu ◽

Wei Zhang ◽

Yanlu Xie ◽

Jinsong Zhang

Keyword(s):

Speaker Recognition ◽

Native Speakers ◽

Estimation Method ◽

Experimental Results ◽

Percentage Error ◽

Range Estimation ◽

Pitch Range ◽

Speech Segment ◽

The Mean ◽

Speech Segments

Pitch-range estimation from brief speech segments could bring benefits to many tasks like automatic speech recognition and speaker recognition. To estimate pitch range, previous studies have proposed to utilize deep-learning-based models with spectrum information as input. They demonstrated that such method works and could still achieve reliable estimation results when the speech segment is as brief as 300 ms. In this study, we evaluated the robustness of this method. We take the following scenarios into account: (1) a large number of training speakers; (2) different language backgrounds; and (3) monosyllabic utterances with different tones. Experimental results showed that: (1) The use of a large number of training speakers improved the estimation accuracies. (2) The mean absolute percentage error (MAPE) rate evaluated on the L2 speakers is similar to that on the native speakers. (3) Different tonal information will affect the LSTM-based model, but this influence is limited compared to the baseline method which calculates pitch-range targets from the distribution of [Formula: see text]0 values. These experimental results verified the efficiency of the LSTM-based pitch-range estimation method.

Download Full-text

An Optimal Feature Parameter Set Based on Gated Recurrent Unit Recurrent Neural Networks for Speech Segment Detection

Applied Sciences ◽

10.3390/app10041273 ◽

2020 ◽

Vol 10 (4) ◽

pp. 1273 ◽

Cited By ~ 5

Author(s):

Özlem BATUR DİNLER ◽

Nizamettin AYDIN

Keyword(s):

Neural Networks ◽

Recurrent Neural Networks ◽

Feature Vector ◽

Processing Parameters ◽

Speech Segmentation ◽

Hybrid Features ◽

Speech Segment ◽

Experimental Findings ◽

Gated Recurrent Unit ◽

Optimal Feature

Speech segment detection based on gated recurrent unit (GRU) recurrent neural networks for the Kurdish language was investigated in the present study. The novelties of the current research are the utilization of a GRU in Kurdish speech segment detection, creation of a unique database from the Kurdish language, and optimization of processing parameters for Kurdish speech segmentation. This study is the first attempt to find the optimal feature parameters of the model and to form a large Kurdish vocabulary dataset for a speech segment detection based on consonant, vowel, and silence (C/V/S) discrimination. For this purpose, four window sizes and three window types with three hybrid feature vector techniques were used to describe the phoneme boundaries. Identification of the phoneme boundaries using a GRU recurrent neural network was performed with six different classification algorithms for the C/V/S discrimination. We have demonstrated that the GRU model has achieved outstanding speech segmentation performance for characterizing Kurdish acoustic signals. The experimental findings of the present study show the significance of the segment detection of speech signals by effectively utilizing hybrid features, window sizes, window types, and classification models for Kurdish speech.

Download Full-text

On the Optimum Speech Segment Length for Depression Detection

2019 IEEE International Conference on Smart Instrumentation, Measurement and Application (ICSIMA) ◽

10.1109/icsima47653.2019.9057319 ◽

2019 ◽

Author(s):

Muhammad Fahreza Alghifari ◽

Teddy Surya Gunawan ◽

Mimi Aminah Wan Nordin ◽

Mira Kartiwi ◽

Lihanna Borhan

Keyword(s):

Segment Length ◽

Speech Segment ◽

Depression Detection

Download Full-text

Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model

2019 IEEE International Conference on Signals and Systems (ICSigSys) ◽

10.1109/icsigsys.2019.8811080 ◽

2019 ◽

Author(s):

Bagus Tris Atmaja ◽

Masato Akagi

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Attention Model ◽

Speech Segment

Download Full-text

Effect of speech segment samples selection in stutter block detection and remediation

Journal of Intelligent Information Systems ◽

10.1007/s10844-019-00546-z ◽

2019 ◽

Vol 53 (2) ◽

pp. 241-264

Author(s):

Pierre Arbajian ◽

Ayman Hajja ◽

Zbigniew W. Raś ◽

Alicja A. Wieczorkowska

Keyword(s):

Speech Segment ◽

Samples Selection

Download Full-text

speech segment
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

U-Vectors: Generating Clusterable Speaker Embedding from Unlabeled Data

An improved voice activity detection method based on spectral features and neural network

The Structure of Phonological Networks and Social Identity of Heritage Languages

TRANSLATION TECHNIQUES OF STYLISTIC PHENOMENA IN A TV INTERVIEWER’S SPEECH

A Preprocessing Strategy for Denoising of Speech Data Based on Speech Segment Detection

A Study on the Robustness of Pitch-Range Estimation from Brief Speech Segments

An Optimal Feature Parameter Set Based on Gated Recurrent Unit Recurrent Neural Networks for Speech Segment Detection

On the Optimum Speech Segment Length for Depression Detection

Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model

Effect of speech segment samples selection in stutter block detection and remediation

Export Citation Format

speech segmentRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

U-Vectors: Generating Clusterable Speaker Embedding from Unlabeled Data

An improved voice activity detection method based on spectral features and neural network

The Structure of Phonological Networks and Social Identity of Heritage Languages

TRANSLATION TECHNIQUES OF STYLISTIC PHENOMENA IN A TV INTERVIEWER’S SPEECH

A Preprocessing Strategy for Denoising of Speech Data Based on Speech Segment Detection

A Study on the Robustness of Pitch-Range Estimation from Brief Speech Segments

An Optimal Feature Parameter Set Based on Gated Recurrent Unit Recurrent Neural Networks for Speech Segment Detection

On the Optimum Speech Segment Length for Depression Detection

Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model

Effect of speech segment samples selection in stutter block detection and remediation

speech segment
Recently Published Documents