speech segment
Recently Published Documents


TOTAL DOCUMENTS

72
(FIVE YEARS 14)

H-INDEX

11
(FIVE YEARS 1)

2021 ◽  
Vol 11 (21) ◽  
pp. 10079
Author(s):  
Muhammad Firoz Mridha ◽  
Abu Quwsar Ohi ◽  
Muhammad Mostafa Monowar ◽  
Md. Abdul Hamid ◽  
Md. Rashedul Islam ◽  
...  

Speaker recognition deals with recognizing speakers by their speech. Most speaker recognition systems are built upon two stages, the first stage extracts low dimensional correlation embeddings from speech, and the second performs the classification task. The robustness of a speaker recognition system mainly depends on the extraction process of speech embeddings, which are primarily pre-trained on a large-scale dataset. As the embedding systems are pre-trained, the performance of speaker recognition models greatly depends on domain adaptation policy, which may reduce if trained using inadequate data. This paper introduces a speaker recognition strategy dealing with unlabeled data, which generates clusterable embedding vectors from small fixed-size speech frames. The unsupervised training strategy involves an assumption that a small speech segment should include a single speaker. Depending on such a belief, a pairwise constraint is constructed with noise augmentation policies, used to train AutoEmbedder architecture that generates speaker embeddings. Without relying on domain adaption policy, the process unsupervisely produces clusterable speaker embeddings, termed unsupervised vectors (u-vectors). The evaluation is concluded in two popular speaker recognition datasets for English language, TIMIT, and LibriSpeech. Also, a Bengali dataset is included to illustrate the diversity of the domain shifts for speaker recognition systems. Finally, we conclude that the proposed approach achieves satisfactory performance using pairwise architectures.


2021 ◽  
Vol 263 (2) ◽  
pp. 4570-4580
Author(s):  
Liu Ting ◽  
Luo Xinwei

The recognition accuracy of speech signal and noise signal is greatly affected under low signal-to-noise ratio. The neural network with parameters obtained from the training set can achieve good results in the existing data, but is poor for the samples with different the environmental noises. This method firstly extracts the features based on the physical characteristics of the speech signal, which have good robustness. It takes the 3-second data as samples, judges whether there is speech component in the data under low signal-to-noise ratios, and gives a decision tag for the data. If a reasonable trajectory which is like the trajectory of speech is found, it is judged that there is a speech segment in the 3-second data. Then, the dynamic double threshold processing is used for preliminary detection, and then the global double threshold value is obtained by K-means clustering. Finally, the detection results are obtained by sequential decision. This method has the advantages of low complexity, strong robustness, and adaptability to multi-national languages. The experimental results show that the performance of the method is better than that of traditional methods under various signal-to-noise ratios, and it has good adaptability to multi language.


Author(s):  
Mohd Hamid Raza

This paper provides the basic information of the phonological networks and social identity about the heritage languages. The phonological networks convey the classification of the sound systems, while the social identity declares the difference among the native speakers of the heritage languages. The problem is investigated that how a particular speech segment created the variation among the speakers of the different languages in the speech communities. The objective of this paper is to determine the unique segments of the heritage languages and how these segments clear the social identity of the speakers in a particular speech community. The researcher collected the sample of primary and secondary data from the gadgets and the speakers of the heritage languages. The sample of data goes to the social characteristics of ages between twenty and forty of the respondents both male and female. The data are collected through observation, interview and the available literature of the heritage languages. For the collection of primary data, the high quality of the tape recorder is used and put approach to the mouth of the respondents for the recording at the time of interview. After the data collection, it is analysed base on the aspects of phonetics and phonology to find out the social identity of the respondents. In the result, it is found out that one particular speech segment represented the social identity of the speakers. In the framework of conclusion, it is represented that Urdu has different types of the speech segments covered all the processes of production, transmission, and perception.


Author(s):  
Olexandra Popova ◽  
Oleg Bolgar ◽  
Tomashevska Anastasiia

The work is devoted to the study of translation peculiarities of stylistic features of a TV interviewer’s English speech into Ukrainian. The correct translation of the content of a speech segment requires good knowledge of vocabulary, and the capability to recognize the stylistic features of a foreign language in communication, and the techniques used to translate them into the native language. Translation of metaphors, metonymy, comparisons is of particular difficulty for the translator, however, it is the ability to use various techniques in translation that helps the translator to convey the meaning of the statement to the listener adequately. The television interview is characterized as an independent journalistic genre, is a kind of television communication, a purposeful individual-social speech phenomenon, which consists of the organized interaction detween the speakers and finds its expression in a specific dialogically constructed form. The results of this study show that each translated stylistic unit is characterized by a set of transformations typical for it. In most cases, these transformations involve lexical and syntactic stylistic devices. At the level of vocabulary, the speech of a TV interviewer is characterized by a significant number of colloquialisms, colloquial cliches, phraseological units. At the level of syntax, the typical indicators of the conversational style of the TV nterviewer are parallel constructions, repeated requests, elliptical sentences, repetitions, unfinished phrases, and the absence of inversion in interrogative sentences. The information obtained as the result of convergence of stylistic devices as a set of components participating together with other linguistic units in the formation of expressiveness, emotiveness and evaluation is one of the important sources of the language pragmatic function.


2020 ◽  
Vol 10 (20) ◽  
pp. 7385
Author(s):  
Seung-Jun Lee ◽  
Hyuk-Yoon Kwon

In this paper, we propose a preprocessing strategy for denoising of speech data based on speech segment detection. A design of computationally efficient speech denoising is necessary to develop a scalable method for large-scale data sets. Furthermore, it becomes more important as the deep learning-based methods have been developed because they require significant costs while showing high performance in general. The basic idea of the proposed method is using the speech segment detection so as to exclude non-speech segments before denoising. The speech segmentation detection can exclude non-speech segments with a negligible cost, which will be removed in denoising process with a much higher cost, while maintaining the accuracy of denoising. First, we devise a framework to choose the best preprocessing method for denoising based on the speech segment detection for a target environment. For this, we speculate the environments for denoising using different levels of signal-to-noise ratio (SNR) and multiple evaluation metrics. The framework finds the best speech segment detection method tailored to a target environment according to the performance evaluation of speech segment detection methods. Next, we investigate the accuracy of the speech segment detection methods extensively. We conduct the performance evaluation of five speech segment detection methods with different levels of SNRs and evaluation metrics. Especially, we show that we can adjust the accuracy between the precision and recall of each method by controlling a parameter. Finally, we incorporate the best speech segment detection method for a target environment into a denoising process. Through extensive experiments, we show that the accuracy of the proposed scheme is comparable to or even better than that of Wavenet-based denoising, which is one of recent advanced denoising methods based on deep neural networks, in terms of multiple evaluation metrics of denoising, i.e., SNR, STOI, and PESQ, while it can reduce the denoising time of the Wavenet-based denoising by approximately 40–50% according to the used speech segment detection method.


2020 ◽  
Vol 30 (01) ◽  
pp. 2050003
Author(s):  
Wenjie Peng ◽  
Kaiqi Fu ◽  
Wei Zhang ◽  
Yanlu Xie ◽  
Jinsong Zhang

Pitch-range estimation from brief speech segments could bring benefits to many tasks like automatic speech recognition and speaker recognition. To estimate pitch range, previous studies have proposed to utilize deep-learning-based models with spectrum information as input. They demonstrated that such method works and could still achieve reliable estimation results when the speech segment is as brief as 300 ms. In this study, we evaluated the robustness of this method. We take the following scenarios into account: (1) a large number of training speakers; (2) different language backgrounds; and (3) monosyllabic utterances with different tones. Experimental results showed that: (1) The use of a large number of training speakers improved the estimation accuracies. (2) The mean absolute percentage error (MAPE) rate evaluated on the L2 speakers is similar to that on the native speakers. (3) Different tonal information will affect the LSTM-based model, but this influence is limited compared to the baseline method which calculates pitch-range targets from the distribution of [Formula: see text]0 values. These experimental results verified the efficiency of the LSTM-based pitch-range estimation method.


2020 ◽  
Vol 10 (4) ◽  
pp. 1273 ◽  
Author(s):  
Özlem BATUR DİNLER ◽  
Nizamettin AYDIN

Speech segment detection based on gated recurrent unit (GRU) recurrent neural networks for the Kurdish language was investigated in the present study. The novelties of the current research are the utilization of a GRU in Kurdish speech segment detection, creation of a unique database from the Kurdish language, and optimization of processing parameters for Kurdish speech segmentation. This study is the first attempt to find the optimal feature parameters of the model and to form a large Kurdish vocabulary dataset for a speech segment detection based on consonant, vowel, and silence (C/V/S) discrimination. For this purpose, four window sizes and three window types with three hybrid feature vector techniques were used to describe the phoneme boundaries. Identification of the phoneme boundaries using a GRU recurrent neural network was performed with six different classification algorithms for the C/V/S discrimination. We have demonstrated that the GRU model has achieved outstanding speech segmentation performance for characterizing Kurdish acoustic signals. The experimental findings of the present study show the significance of the segment detection of speech signals by effectively utilizing hybrid features, window sizes, window types, and classification models for Kurdish speech.


Author(s):  
Muhammad Fahreza Alghifari ◽  
Teddy Surya Gunawan ◽  
Mimi Aminah Wan Nordin ◽  
Mira Kartiwi ◽  
Lihanna Borhan

2019 ◽  
Vol 53 (2) ◽  
pp. 241-264
Author(s):  
Pierre Arbajian ◽  
Ayman Hajja ◽  
Zbigniew W. Raś ◽  
Alicja A. Wieczorkowska

Sign in / Sign up

Export Citation Format

Share Document