scholarly journals Performance of Forced-Alignment Algorithms on Children's Speech

Author(s):  
Tristan J. Mahr ◽  
Visar Berisha ◽  
Kan Kawabata ◽  
Julie Liss ◽  
Katherine C. Hustad

Purpose Acoustic measurement of speech sounds requires first segmenting the speech signal into relevant units (words, phones, etc.). Manual segmentation is cumbersome and time consuming. Forced-alignment algorithms automate this process by aligning a transcript and a speech sample. We compared the phoneme-level alignment performance of five available forced-alignment algorithms on a corpus of child speech. Our goal was to document aligner performance for child speech researchers. Method The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab-Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals. Results The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Conclusion The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors. Supplemental Material https://doi.org/10.23641/asha.14167058

2020 ◽  
Author(s):  
Tristan Mahr ◽  
Visar Berisha ◽  
Kan Kawabata ◽  
Julie Liss ◽  
Katherine Hustad

Aim. We compared the performance of five forced-alignment algorithms on a corpus of child speech.Method. The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals.Results. The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Interpretation. The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors.


2018 ◽  
Vol 61 (10) ◽  
pp. 2487-2501 ◽  
Author(s):  
Thea Knowles ◽  
Meghan Clayards ◽  
Morgan Sonderegger

Purpose Heterogeneous child speech was force-aligned to investigate whether (a) manipulating specific parameters could improve alignment accuracy and (b) forced alignment could be used to replicate published results on acoustic characteristics of /s/ production by children. Method In Part 1, child speech from 2 corpora was force-aligned with a trainable aligner (Prosodylab-Aligner) under different conditions that systematically manipulated input training data and the type of transcription used. Alignment accuracy was determined by comparing hand and automatic alignments as to how often they overlapped (%-Match) and absolute differences in duration and boundary placements. Using mixed-effects regression, accuracy was modeled as a function of alignment conditions, as well as segment and child age. In Part 2, forced alignments derived from a subset of the alignment conditions in Part 1 were used to extract spectral center of gravity of /s/ productions from young children. These findings were compared to published results that used manual alignments of the same data. Results Overall, the results of Part 1 demonstrated that using training data more similar to the data to be aligned as well as phonetic transcription led to improvements in alignment accuracy. Speech from older children was aligned more accurately than younger children. In Part 2, /s/ center of gravity extracted from force-aligned segments was found to diverge in the speech of male and female children, replicating the pattern found in previous work using manually aligned segments. This was true even for the least accurate forced alignment method. Conclusions Alignment accuracy of child speech can be improved by using more specific training and transcription. However, poor alignment accuracy was not found to impede acoustic analysis of /s/ produced by even very young children. Thus, forced alignment presents a useful tool for the analysis of child speech. Supplemental Material https://doi.org/10.23641/asha.7070105


1977 ◽  
Vol 4 (1) ◽  
pp. 67-86 ◽  
Author(s):  
Ben G. Blount ◽  
Elise J. Padgug

ABSTRACTParents employ a special register when speaking to young children, containing features that mark it as appropriate for children who are beginning to acquire their language. Parental speech in English to 5 children (ages 0; 9–1; 6) and in Spanish to 4 children (ages 0; 8–1; 1 and 1; 6–1; 10) was analysed for the presence and distribution of these features. Thirty-four paralinguistic, prosodic, and interactional features were identified, and rate measures and proportions indicated developmental patterns and differences across languages. Younger children received a higher rate of features that marked affect; older children were addressed with more features that marked semantically meaningful speech. English-speaking parents relied comparatively more on paralinguistic and affective features, whereas Spanish-speaking parents used comparatively more interactional features. Despite these differences, there was a high degree of similarity across parents and languages for the most frequently occurring features.


Author(s):  
Elena Babatsouli ◽  
David Ingram ◽  
Dimitrios A. Sotiropoulos

AbstractTypical morpho-phonological measures of children’s speech realizations used in the literature depend linearly on their components. Examples are the proportion of consonants correct, the mean length of utterance and the phonological mean length of utterance. Because of their linear dependence on their components, these measures change in proportion to their component changes between speech realizations. However, there are instances in which variable speech realizations need to be differentiated better. Therefore, a measure which is more sensitive to its components than linear measures is needed. Here, entropy is proposed as such a measure. The sensitivity of entropy is compared analytically to that of linear measures, deriving ranges in component values inside which entropy is guaranteed to be more sensitive than the linear measures. The analysis is complemented by computing the entropy in two children’s English speech for different categories of word complexity and comparing its sensitivity to that of linear measures. One of the children is a bilingual typically developing child at age 3;0 and the other child is a monolingual child with speech sound disorders at age 5;11. The analysis and applications demonstrate the usefulness of the measure for evaluating speech realizations and its relative advantages over linear measures.


2015 ◽  
Vol 51 (2) ◽  
Author(s):  
María Luisa García Lecumberri ◽  
Martin Cooke ◽  
Christopher Bryant

AbstractA key issue in judging foreign accent is to isolate the phonetic component from potentially confounding higher-level factors such as grammatical or prosodic errors which arise when using natural sentence-length speech material. The current study evaluated accent and intelligibility ratings of children’s speech for isolated words spliced out of extemporaneous material elicited via a picture description task. Experiment 1 demonstrated that word scores and accent ratings provided by native judges pattern as in earlier studies, validating the use of word-based material derived from natural speech. In a second experiment, listeners rated the degree of foreign accent and comprehensibility for unrelated sequences of 1 to 8 words from the same talker. Degree of foreign accent was judged to increase with sequence length, asymptoting by 2 word sequences, although listeners did not rate the sequence based on the most-accented word it contains. Comprehensibility was judged to be lower as sequence length increased, asymptoting at 4 words. These findings suggest that short sequences of randomly-permuted words extracted from extemporaneous speech can be used for robust accent and comprehensibility judgements whose focus is on the phonetic basis for deviations from the native norm.


2021 ◽  
pp. 1-25
Author(s):  
Tania S. ZAMUNER ◽  
Theresa RABIDEAU ◽  
Margarethe MCDONALD ◽  
H. Henny YEUNG

Abstract This study investigates how children aged two to eight years (N = 129) and adults (N = 29) use auditory and visual speech for word recognition. The goal was to bridge the gap between apparent successes of visual speech processing in young children in visual-looking tasks, with apparent difficulties of speech processing in older children from explicit behavioural measures. Participants were presented with familiar words in audio-visual (AV), audio-only (A-only) or visual-only (V-only) speech modalities, then presented with target and distractor images, and looking to targets was measured. Adults showed high accuracy, with slightly less target-image looking in the V-only modality. Developmentally, looking was above chance for both AV and A-only modalities, but not in the V-only modality until 6 years of age (earlier on /k/-initial words). Flexible use of visual cues for lexical access develops throughout childhood.


2021 ◽  
pp. 43-45
Author(s):  
Ankita Kumari ◽  
K. Srikumar

The Speech Language Pathologists and language experts need material to collect the speech sample which they can evaluate and analyze for normalcy. For older children the speech sample can be collected even in spontaneous speech or by reading of standardized text, but this cannot be done for younger children who cannot read sentences and words. For these children standardized set of word list is required so that their phonology can be checked for normalcy and intelligibility. This word list must not only be structured for presence of each sound at all positions but also these words should be familiar to the younger age group (present in their vocabulary) as the need to identify a picture for it and name it. Such structured material is still limited in Hindi Language. The present study aims the development of word list in Hindi Language and check the familiarity of the word list. The word list prepared was shown to 10 teachers of preschool (Nursery to Upper Kindergarten). The words were rated on a three point rating scale and the results were analyzed using descriptive statistics. Those words found more than 75% familiarity may be used with younger children for speech sample collection. The words with familiarity between 50 to 75% can be used with younger children along with few semantic and phonetic cues.


1992 ◽  
Vol 13 (1) ◽  
pp. 77-91 ◽  
Author(s):  
Paul J. Yoder ◽  
Betty Davies

ABSTRACTThe unintelligible speech of many developmentally delayed children poses problems for language intervention and language assessment efforts. Eighteen developmentally delayed children in Brown's (1973) stage I and their parents participated in two studies of the relationship between verbal routines and the intelligibility of developmentally delayed children's speech. The first study demonstrated that more intelligible child speech was found in routines than in nonroutines. To determine if routine utterances were articulated more accurately than nonroutine utterances, the second study extracted a representative sample of routine and nonroutine utterances from their visual and discourse contexts and asked two naive observers to transcribe them. To investigate the possible effect of contextual information, the naive observers transcribed the extracted utterances under context-information-present and context-information- absent conditions. The results indicated that extracted utterances were more intelligible under context-information-present conditions. The results were interpreted as indicating that child speech was more intelligible in routines than nonroutines because routines provide adults with more context information for interpreting ambiguous child utterances.


1988 ◽  
Vol 53 (1) ◽  
pp. 2-7 ◽  
Author(s):  
Barbara G. MacLachlan ◽  
Robin S. Chapman

The frequency and type of communication breakdowns occurring in the speech of 7 language learning-disabled children (LLD), aged 9:10–11:1 (years:months), were examined in two conditions, conversation and narration, and compared to a group of 7 normal peers matched for chronological age and 7 peers matched for mean length of communication unit in conversation. Types of communication breakdowns examined included stalls, repairs, and abandoned utterances. The LLD group incurred a significantly greater rate of communication breakdowns per communication unit in narration than conversation compared to control group differences. Mean length of communication unit was also significantly greater in narration than conversation for the LLD group compared to controls. For all groups, across both speech sample conditions, longer communication units contained more breakdowns than shorter ones. The groups did not differ in the types of breakdowns. Communication unit length and the nature of the narrative task may account for the increased dysfluencies in LLD children's speech.


Sign in / Sign up

Export Citation Format

Share Document