scholarly journals Synchronising speech segments with musical beats in Mandarin and English singing

2021 ◽  
Author(s):  
Cong Zhang ◽  
Jian Zhu

Generating synthesised singing voice with models trained on speech data has many advantages due to the models' flexibility and controllability. However, since the information about the temporal relationship between segments and beats are lacking in speech training data, the synthesised singing may sound off-beat at times. Therefore, the availability of the information on the temporal relationship between speech segments and music beats is crucial. The current study investigated the segment-beat synchronisation in singing data, with hypotheses formed based on the linguistics theories of P-centre and sonority hierarchy. A Mandarin corpus and an English corpus of professional singing data were manually annotated and analysed. The results showed that the presence of musical beats was more dependent on segment duration than sonority. However, the sonority hierarchy and the P-centre theory were highly related to the location of beats. Mandarin and English demonstrated cross-linguistic variations despite exhibiting common patterns.

Author(s):  
Yuewen Cao ◽  
Songxiang Liu ◽  
Shiyin Kang ◽  
Na Hu ◽  
Peng Liu ◽  
...  

Author(s):  
Kazuo Ueda ◽  
Valter Ciocca

AbstractIntelligibility of temporally degraded speech was investigated with locally time-reversed speech (LTR) and its interrupted version (ILTR). Control stimuli comprising interrupted speech (I) were also included. Speech stimuli consisted of 200 Japanese meaningful sentences. In interrupted stimuli, speech segments were alternated with either silent gaps or pink noise bursts. The noise bursts had a level of − 10, 0 or + 10 dB relative to the speech level. Segment duration varied from 20 to 160 ms for ILTR sentences, but was fixed at 160 ms for I sentences. At segment durations between 40 and 80 ms, severe reductions in intelligibility were observed for ILTR sentences, compared with LTR sentences. A substantial improvement in intelligibility (30–33%) was observed when 40-ms silent gaps in ILTR were replaced with 0- and + 10-dB noise. Noise with a level of − 10 dB had no effect on the intelligibility. These findings show that the combined effects of interruptions and temporal reversal of speech segments on intelligibility are greater than the sum of each individual effect. The results also support the idea that illusory continuity induced by high-level noise bursts improves the intelligibility of ILTR and I sentences.


2014 ◽  
Vol 5 (2) ◽  
pp. 231-251 ◽  
Author(s):  
Shu-Chuan Tseng

This paper presents a study of segment duration in Chinese disyllabic words. The study accounts for boundary-related factors at levels of syllable, word, prosodic unit, and discourse unit. Face-to-face conversational speech data annotated with signal-aligned, multi-layer linguistic information was used for the analysis. A series of quantitative results show that Chinese disyllabic words have a long first syllable onset and a long second syllable rhyme, suggesting an edge effect of disyllabic words. This is in line with disyllabic merger in Chinese that preserves the onset of the first syllable and the rhyme of the second syllable. A shortening effect at prosodic and discourse unit initiation locations is due to a duration reduction of the second syllable onset, whereas the common phenomenon of pre-boundary lengthening is mainly a result of the second syllable rhyme prolongation including the glide, nucleus, and coda. Morphologically inseparable disyllabic words in principle follow the “long first onset and long second rhyme” duration pattern. But diverse duration patterns were found in words with a head-complement and a stem-suffix construction, suggesting that word morphology may also play a role in determining the duration pattern of Chinese disyllabic words in conversational speech.


10.14311/1105 ◽  
2009 ◽  
Vol 49 (2) ◽  
Author(s):  
J. Rajnoha

Automatic speech recognition (ASR) systems frequently work in a noisy environment. As they are often trained on clean speech data, noise reduction or adaptation techniques are applied to decrease the influence of background disturbance even in the case of unknown conditions. Speech data mixed with noise recordings from particular environment are often used for the purposes of model adaptation. This paper analyses the improvement of recognition performance within such adaptation when multi-condition training data from a real environment is used for training initial models. Although the quality of such models can decrease with the presence of noise in the training material, they are assumed to include initial information about noise and consequently support the adaptation procedure. Experimental results show significant improvement of the proposed training method in a robust ASR task under unknown noisy conditions. The decrease by 29 % and 14 % in word error rate in comparison with clean speech training data was achieved for the non-adapted and adapted system, respectively. 


2020 ◽  
Vol 10 (18) ◽  
pp. 6155
Author(s):  
Byung Ok Kang ◽  
Hyeong Bae Jeon ◽  
Jeon Gue Park

We propose two approaches to handle speech recognition for task domains with sparse matched training data. One is an active learning method that selects training data for the target domain from another general domain that already has a significant amount of labeled speech data. This method uses attribute-disentangled latent variables. For the active learning process, we designed an integrated system consisting of a variational autoencoder with an encoder that infers latent variables with disentangled attributes from the input speech, and a classifier that selects training data with attributes matching the target domain. The other method combines data augmentation methods for generating matched target domain speech data and transfer learning methods based on teacher/student learning. To evaluate the proposed method, we experimented with various task domains with sparse matched training data. The experimental results show that the proposed method has qualitative characteristics that are suitable for the desired purpose, it outperforms random selection, and is comparable to using an equal amount of additional target domain data.


Corpora ◽  
2016 ◽  
Vol 11 (3) ◽  
pp. 401-431 ◽  
Author(s):  
Robert Fromont ◽  
Kevin Watson

Automatically time-aligning utterances at the segmental level is increasingly common practice in phonetic and sociophonetic work because of the obvious benefits it brings in allowing the efficient scaling up of the amount of speech data that can be analysed. The field is arriving at a set of recommended practices for improving alignment accuracy, but methodological differences across studies (e.g., the use of different languages and different measures of accuracy) often mean that direct comparison of the factors which facilitate or hinder alignment can be difficult. In this paper, following a review of the state of the art in automatic segmental alignment, we test the effects of a number of factors on its accuracy. Namely, we test the effects of: (1) the presence or absence of pause markers in the training data, (2) the presence of overlapping speech or other noise, (3) using training data from single or multiple speakers, (4) using different sampling rates, (5) using pre-trained acoustic models versus models trained ‘from scratch’, and (6) using different amounts of training data. For each test, we examine three different varieties of English, from New Zealand, the USA and the UK. The paper concludes with some recommendations for automatic segmental alignment in general.


2020 ◽  
Vol 2 (1) ◽  
pp. 53
Author(s):  
Julisah Izar ◽  
Muhammad Muslim Nasution ◽  
Mie Ratnasari

The main objective of this study is to find out the types and functions of assertive speech acts that appear in Mata Najwa program in episode Gara-Gara Corona. The method used in this study is a qualitative descriptive method by analyzing and explaining the data obtained. The data in this study were speech segments in the Mata Najwa episode Gara-Gara Corona which indicate assertive speech acts. The data source in this study is the video of the program entitled "gara-gara Corona" taken from Youtube, which was published by Narration Newsroom on March 13, 2020. The steps taken in data collection techniques were, firstly downloading video from Youtube, the second listens to the utterances, and the third transcribes the utterances into written language. Then the speech data obtained was selected based on research questions related to the types and functions of assertive speech acts. At the stage of presenting the results of the analysis, it was presented using an informal presentation method, namely presentation using ordinary words. The results showed that the assertive speech act types that appeared in the event were 23 pairs of utterances, consisting of 11 pairs of assertive speech acts telling, 6 pairs of assertive speech acts stated, 3 pairs of assertive speech acts suggest, and 3 pairs of utterances assertive speech acts boast.


Author(s):  
Wening Mustikarini ◽  
Risanuri Hidayat ◽  
Agus Bejo

Abstract — Automatic Speech Recognition (ASR) is a technology that uses machines to process and recognize human voice. One way to increase recognition rate is to use a model of language you want to recognize. In this paper, a speech recognition application is introduced to recognize words "atas" (up), "bawah" (down), "kanan" (right), and "kiri" (left). This research used 400 samples of speech data, 75 samples from each word for training data and 25 samples for each word for test data. This speech recognition system was designed using Mel Frequency Cepstral Coefficient (MFCC) as many as 13 coefficients as features and Support Vector Machine (SVM) as identifiers. The system was tested with linear kernels and RBF, various cost values, and three sample sizes (n = 25, 75, 50). The best average accuracy value was obtained from SVM using linear kernels, a cost value of 100 and a data set consisted of 75 samples from each class. During the training phase, the system showed a f1-score (trade-off value between precision and recall) of 80% for the word "atas", 86% for the word "bawah", 81% for the word "kanan", and 100% for the word "kiri". Whereas by using 25 new samples per class for system testing phase, the f1-score was 76% for the "atas" class, 54% for the "bawah" class, 44% for the "kanan" class, and 100% for the "kiri" class.


Sign in / Sign up

Export Citation Format

Share Document