Hypo and Hyperarticulated Speech Data Augmentation for Spontaneous Speech Recognition

Author(s):  
Sung Joo Lee ◽  
Byung-Ok Kang ◽  
Hoon Chung ◽  
Jeon Gue Park ◽  
Yun Keun Lee
2020 ◽  
Vol 10 (18) ◽  
pp. 6155
Author(s):  
Byung Ok Kang ◽  
Hyeong Bae Jeon ◽  
Jeon Gue Park

We propose two approaches to handle speech recognition for task domains with sparse matched training data. One is an active learning method that selects training data for the target domain from another general domain that already has a significant amount of labeled speech data. This method uses attribute-disentangled latent variables. For the active learning process, we designed an integrated system consisting of a variational autoencoder with an encoder that infers latent variables with disentangled attributes from the input speech, and a classifier that selects training data with attributes matching the target domain. The other method combines data augmentation methods for generating matched target domain speech data and transfer learning methods based on teacher/student learning. To evaluate the proposed method, we experimented with various task domains with sparse matched training data. The experimental results show that the proposed method has qualitative characteristics that are suitable for the desired purpose, it outperforms random selection, and is comparable to using an equal amount of additional target domain data.


2008 ◽  
Vol 155 ◽  
pp. 23-52
Author(s):  
Elma Nap-Kolhoff ◽  
Peter Broeder

Abstract This study compares pronominal possessive constructions in Dutch first language (L1) acquisition, second language (L2) acquisition by young children, and untutored L2 acquisition by adults. The L2 learners all have Turkish as L1. In longitudinal spontaneous speech data for four L1 learners, seven child L2 learners, and two adult learners, remarkable differences and similarities between the three learner groups were found. In some respects, the child L2 learners develop in a way that is similar to child L1 learners, for instance in the kind of overgeneralisations that they make. However, the child L2 learners also behave like adult L2 learners; i.e., in the pace of the acquisition process, the frequency and persistence of non-target constructions, and the difficulty in acquiring reduced pronouns. The similarities between the child and adult L2 learners are remarkable, because the child L2 learners were only two years old when they started learning Dutch. L2 acquisition before the age of three is often considered to be similar to L1 acquisition. The findings might be attributable to the relatively small amount of Dutch language input the L2 children received.


2022 ◽  
Vol 14 (2) ◽  
pp. 614
Author(s):  
Taniya Hasija ◽  
Virender Kadyan ◽  
Kalpna Guleria ◽  
Abdullah Alharbi ◽  
Hashem Alyami ◽  
...  

Speech recognition has been an active field of research in the last few decades since it facilitates better human–computer interaction. Native language automatic speech recognition (ASR) systems are still underdeveloped. Punjabi ASR systems are in their infancy stage because most research has been conducted only on adult speech systems; however, less work has been performed on Punjabi children’s ASR systems. This research aimed to build a prosodic feature-based automatic children speech recognition system using discriminative modeling techniques. The corpus of Punjabi children’s speech has various runtime challenges, such as acoustic variations with varying speakers’ ages. Efforts were made to implement out-domain data augmentation to overcome such issues using Tacotron-based text to a speech synthesizer. The prosodic features were extracted from Punjabi children’s speech corpus, then particular prosodic features were coupled with Mel Frequency Cepstral Coefficient (MFCC) features before being submitted to an ASR framework. The system modeling process investigated various approaches, which included Maximum Mutual Information (MMI), Boosted Maximum Mutual Information (bMMI), and feature-based Maximum Mutual Information (fMMI). The out-domain data augmentation was performed to enhance the corpus. After that, prosodic features were also extracted from the extended corpus, and experiments were conducted on both individual and integrated prosodic-based acoustic features. It was observed that the fMMI technique exhibited 20% to 25% relative improvement in word error rate compared with MMI and bMMI techniques. Further, it was enhanced using an augmented dataset and hybrid front-end features (MFCC + POV + Fo + Voice quality) with a relative improvement of 13% compared with the earlier baseline system.


Author(s):  
Conghui Tan ◽  
Di Jiang ◽  
Jinhua Peng ◽  
Xueyang Wu ◽  
Qian Xu ◽  
...  

Due to the rising awareness of privacy protection and the voluminous scale of speech data, it is becoming infeasible for Automatic Speech Recognition (ASR) system developers to train the acoustic model with complete data as before. In this paper, we propose a novel Divide-and-Merge paradigm to solve salient problems plaguing the ASR field. In the Divide phase, multiple acoustic models are trained based upon different subsets of the complete speech data, while in the Merge phase two novel algorithms are utilized to generate a high-quality acoustic model based upon those trained on data subsets. We first propose the Genetic Merge Algorithm (GMA), which is a highly specialized algorithm for optimizing acoustic models but suffers from low efficiency. We further propose the SGD-Based Optimizational Merge Algorithm (SOMA), which effectively alleviates the efficiency bottleneck of GMA and maintains superior performance. Extensive experiments on public data show that the proposed methods can significantly outperform the state-of-the-art.


2020 ◽  
Vol 8 (2) ◽  
pp. 117-141
Author(s):  
Alberto Rodríguez Márquez

The objective of this paper is to describe the prosodic features of the final intonation contour of minor intonational phrases (ip) and the tonemes of major intonational phrases (IP) in Mexico City’s Spanish variety. The speech data was taken from a spontaneous speech corpus made from speakers from two social networks: neighborhood and labor. Final intonation contours of ip show a predominantly rising movement. These contours are generally produced with greater length in the last syllable of the ip, which represents the most significant difference between both networks in the case of oxitone endings. On the other hand, tonemes are predominantly descendant, although the circumflex accent has an important number of cases within the data set. Tonemes produced by the neighborhood network are produced with larger length than those from the labor network.


Author(s):  
Kate Broome ◽  
Patricia McCabe ◽  
Kimberley Docking ◽  
Maree Doble ◽  
Bronwyn Carrigg

Purpose This study aimed to provide detailed descriptive information about the speech of a heterogeneous cohort of children with autism spectrum disorder (ASD) and to explore whether subgroups exist based on this detailed speech data. High rates of delayed and disordered speech in both low-verbal and high-functioning children with ASD have been reported. There is limited information regarding the speech abilities of young children across a range of functional levels. Method Participants were 23 children aged 2;0–6;11 (years;months) with a diagnosis of ASD. Comprehensive speech and language assessments were administered. Independent and relational speech analyses were conducted from single-word naming tasks and spontaneous speech samples. Hierarchical clustering based on language, nonverbal communication, and spontaneous speech descriptive data was completed. Results Independent and relational speech analyses are reported. These variables are used in the cluster analyses, which identified three distinct subgroups: (a) children with high language and high speech ability ( n = 10), (b) children with low expressive language and low speech ability but higher receptive language and use of gestures ( n = 3), and (c) children with low language and low speech development ( n = 10). Conclusions This is the first study to provide detailed descriptive speech data of a heterogeneous cohort of children with ASD and use this information to statistically explore potential subgroups. Clustering suggests a small number of children present with low levels of speech and expressive language in the presence of better receptive language and gestures. This communication profile warrants further exploration. Replicating these findings with a larger cohort of children is needed. Supplemental Material https://doi.org/10.23641/asha.16906978


Sign in / Sign up

Export Citation Format

Share Document