Performance of Forced-Alignment Algorithms on Children's Speech

Purpose Acoustic measurement of speech sounds requires first segmenting the speech signal into relevant units (words, phones, etc.). Manual segmentation is cumbersome and time consuming. Forced-alignment algorithms automate this process by aligning a transcript and a speech sample. We compared the phoneme-level alignment performance of five available forced-alignment algorithms on a corpus of child speech. Our goal was to document aligner performance for child speech researchers. Method The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab-Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals. Results The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Conclusion The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors. Supplemental Material https://doi.org/10.23641/asha.14167058

Download Full-text

Performance of forced-alignment algorithms on children’s speech

10.31234/osf.io/97jp4 ◽

2020 ◽

Author(s):

Tristan Mahr ◽

Visar Berisha ◽

Kan Kawabata ◽

Julie Liss ◽

Katherine Hustad

Keyword(s):

Speech Recognition ◽

Gold Standard ◽

Alignment Accuracy ◽

Speech Sample ◽

Older Children ◽

Adaptive Training ◽

Alignment Algorithms ◽

Child Speech ◽

Speech Recognition Engine ◽

Children's Speech

Aim. We compared the performance of five forced-alignment algorithms on a corpus of child speech.Method. The child speech sample included 42 children between 3 and 6 years of age. The corpus was force-aligned using the Montreal Forced Aligner with and without speaker adaptive training, triphone alignment from the Kaldi speech recognition engine, the Prosodylab Aligner, and the Penn Phonetics Lab Forced Aligner. The sample was also manually aligned to create gold-standard alignments. We evaluated alignment algorithms in terms of accuracy (whether the interval covers the midpoint of the manual alignment) and difference in phone-onset times between the automatic and manual intervals.Results. The Montreal Forced Aligner with speaker adaptive training showed the highest accuracy and smallest timing differences. Vowels were consistently the most accurately aligned class of sounds across all the aligners, and alignment accuracy increased with age for fricative sounds across the aligners too. Interpretation. The best-performing aligner fell just short of human-level reliability for forced alignment. Researchers can use forced alignment with child speech for certain classes of sounds (vowels, fricatives for older children), especially as part of a semi-automated workflow where alignments are later inspected for gross errors.

Download Full-text

Examining Factors Influencing the Viability of Automatic Acoustic Analysis of Child Speech

Journal of Speech Language and Hearing Research ◽

10.1044/2018_jslhr-s-17-0275 ◽

2018 ◽

Vol 61 (10) ◽

pp. 2487-2501 ◽

Cited By ~ 1

Author(s):

Thea Knowles ◽

Meghan Clayards ◽

Morgan Sonderegger

Keyword(s):

Young Children ◽

Acoustic Analysis ◽

Center Of Gravity ◽

Training Data ◽

Alignment Accuracy ◽

Older Children ◽

Acoustic Characteristics ◽

Phonetic Transcription ◽

Very Young Children ◽

Child Speech

Purpose Heterogeneous child speech was force-aligned to investigate whether (a) manipulating specific parameters could improve alignment accuracy and (b) forced alignment could be used to replicate published results on acoustic characteristics of /s/ production by children. Method In Part 1, child speech from 2 corpora was force-aligned with a trainable aligner (Prosodylab-Aligner) under different conditions that systematically manipulated input training data and the type of transcription used. Alignment accuracy was determined by comparing hand and automatic alignments as to how often they overlapped (%-Match) and absolute differences in duration and boundary placements. Using mixed-effects regression, accuracy was modeled as a function of alignment conditions, as well as segment and child age. In Part 2, forced alignments derived from a subset of the alignment conditions in Part 1 were used to extract spectral center of gravity of /s/ productions from young children. These findings were compared to published results that used manual alignments of the same data. Results Overall, the results of Part 1 demonstrated that using training data more similar to the data to be aligned as well as phonetic transcription led to improvements in alignment accuracy. Speech from older children was aligned more accurately than younger children. In Part 2, /s/ center of gravity extracted from force-aligned segments was found to diverge in the speech of male and female children, replicating the pattern found in previous work using manually aligned segments. This was true even for the least accurate forced alignment method. Conclusions Alignment accuracy of child speech can be improved by using more specific training and transcription. However, poor alignment accuracy was not found to impede acoustic analysis of /s/ produced by even very young children. Thus, forced alignment presents a useful tool for the analysis of child speech. Supplemental Material https://doi.org/10.23641/asha.7070105

Download Full-text

Prosodic, paralinguistic, and interactional features in parent-child speech: English and Spanish

Journal of Child Language ◽

10.1017/s0305000900000489 ◽

1977 ◽

Vol 4 (1) ◽

pp. 67-86 ◽

Cited By ~ 25

Author(s):

Ben G. Blount ◽

Elise J. Padgug

Keyword(s):

Young Children ◽

Older Children ◽

Developmental Patterns ◽

Spanish Speaking ◽

English Speaking ◽

Child Speech ◽

High Degree ◽

Interactional Features ◽

Degree Of Similarity ◽

Parent Child

ABSTRACTParents employ a special register when speaking to young children, containing features that mark it as appropriate for children who are beginning to acquire their language. Parental speech in English to 5 children (ages 0; 9–1; 6) and in Spanish to 4 children (ages 0; 8–1; 1 and 1; 6–1; 10) was analysed for the presence and distribution of these features. Thirty-four paralinguistic, prosodic, and interactional features were identified, and rate measures and proportions indicated developmental patterns and differences across languages. Younger children received a higher rate of features that marked affect; older children were addressed with more features that marked semantically meaningful speech. English-speaking parents relied comparatively more on paralinguistic and affective features, whereas Spanish-speaking parents used comparatively more interactional features. Despite these differences, there was a high degree of similarity across parents and languages for the most frequently occurring features.

Download Full-text

Entropy as a measure of mixedupness of realizations in child speech

Poznan Studies in Contemporary Linguistics ◽

10.1515/psicl-2016-0024 ◽

2016 ◽

Vol 52 (4) ◽

Cited By ~ 1

Author(s):

Elena Babatsouli ◽

David Ingram ◽

Dimitrios A. Sotiropoulos

Keyword(s):

Linear Dependence ◽

Speech Sound ◽

The Other ◽

Typically Developing ◽

Speech Sound Disorders ◽

The Mean ◽

Child Speech ◽

Mean Length Of Utterance ◽

Children's Speech

AbstractTypical morpho-phonological measures of children’s speech realizations used in the literature depend linearly on their components. Examples are the proportion of consonants correct, the mean length of utterance and the phonological mean length of utterance. Because of their linear dependence on their components, these measures change in proportion to their component changes between speech realizations. However, there are instances in which variable speech realizations need to be differentiated better. Therefore, a measure which is more sensitive to its components than linear measures is needed. Here, entropy is proposed as such a measure. The sensitivity of entropy is compared analytically to that of linear measures, deriving ranges in component values inside which entropy is guaranteed to be more sensitive than the linear measures. The analysis is complemented by computing the entropy in two children’s English speech for different categories of word complexity and comparing its sensitivity to that of linear measures. One of the children is a bilingual typically developing child at age 3;0 and the other child is a monolingual child with speech sound disorders at age 5;11. The analysis and applications demonstrate the usefulness of the measure for evaluating speech realizations and its relative advantages over linear measures.

Download Full-text

Accent evaluation from extemporaneous child speech

Poznan Studies in Contemporary Linguistics ◽

10.1515/psicl-2015-0010 ◽

2015 ◽

Vol 51 (2) ◽

Author(s):

María Luisa García Lecumberri ◽

Martin Cooke ◽

Christopher Bryant

Keyword(s):

Sentence Length ◽

Natural Speech ◽

Foreign Accent ◽

Sequence Length ◽

Picture Description ◽

Child Speech ◽

Intelligibility Ratings ◽

Description Task ◽

Children's Speech ◽

Phonetic Component

AbstractA key issue in judging foreign accent is to isolate the phonetic component from potentially confounding higher-level factors such as grammatical or prosodic errors which arise when using natural sentence-length speech material. The current study evaluated accent and intelligibility ratings of children’s speech for isolated words spliced out of extemporaneous material elicited via a picture description task. Experiment 1 demonstrated that word scores and accent ratings provided by native judges pattern as in earlier studies, validating the use of word-based material derived from natural speech. In a second experiment, listeners rated the degree of foreign accent and comprehensibility for unrelated sequences of 1 to 8 words from the same talker. Degree of foreign accent was judged to increase with sequence length, asymptoting by 2 word sequences, although listeners did not rate the sequence based on the most-accented word it contains. Comprehensibility was judged to be lower as sequence length increased, asymptoting at 4 words. These findings suggest that short sequences of randomly-permuted words extracted from extemporaneous speech can be used for robust accent and comprehensibility judgements whose focus is on the phonetic basis for deviations from the native norm.

Download Full-text

Developmental change in children’s speech processing of auditory and visual cues: An eyetracking study

Journal of Child Language ◽

10.1017/s0305000921000684 ◽

2021 ◽

pp. 1-25

Author(s):

Tania S. ZAMUNER ◽

Theresa RABIDEAU ◽

Margarethe MCDONALD ◽

H. Henny YEUNG

Keyword(s):

Word Recognition ◽

Lexical Access ◽

Speech Processing ◽

Visual Cues ◽

Developmental Change ◽

High Accuracy ◽

Visual Speech ◽

Older Children ◽

Target Image ◽

Children's Speech

Abstract This study investigates how children aged two to eight years (N = 129) and adults (N = 29) use auditory and visual speech for word recognition. The goal was to bridge the gap between apparent successes of visual speech processing in young children in visual-looking tasks, with apparent difficulties of speech processing in older children from explicit behavioural measures. Participants were presented with familiar words in audio-visual (AV), audio-only (A-only) or visual-only (V-only) speech modalities, then presented with target and distractor images, and looking to targets was measured. Adults showed high accuracy, with slightly less target-image looking in the V-only modality. Developmentally, looking was above chance for both AV and A-only modalities, but not in the V-only modality until 6 years of age (earlier on /k/-initial words). Flexible use of visual cues for lexical access develops throughout childhood.

Download Full-text

IDENTIFICATION OF FAMILIAR WORDS FOR HINDI SOUNDS

10.36106/6339532 ◽

2021 ◽

pp. 43-45

Author(s):

Ankita Kumari ◽

K. Srikumar

Keyword(s):

Rating Scale ◽

Word List ◽

Age Group ◽

Speech Sample ◽

Older Children ◽

Sample Collection ◽

Point Rating Scale ◽

Younger Age ◽

Hindi Language ◽

Phonetic Cues

The Speech Language Pathologists and language experts need material to collect the speech sample which they can evaluate and analyze for normalcy. For older children the speech sample can be collected even in spontaneous speech or by reading of standardized text, but this cannot be done for younger children who cannot read sentences and words. For these children standardized set of word list is required so that their phonology can be checked for normalcy and intelligibility. This word list must not only be structured for presence of each sound at all positions but also these words should be familiar to the younger age group (present in their vocabulary) as the need to identify a picture for it and name it. Such structured material is still limited in Hindi Language. The present study aims the development of word list in Hindi Language and check the familiarity of the word list. The word list prepared was shown to 10 teachers of preschool (Nursery to Upper Kindergarten). The words were rated on a three point rating scale and the results were analyzed using descriptive statistics. Those words found more than 75% familiarity may be used with younger children for speech sample collection. The words with familiarity between 50 to 75% can be used with younger children along with few semantic and phonetic cues.

Download Full-text

Manual Segmentation Errors in Medical Imaging. Proposing a Reliable Gold Standard

Communications in Computer and Information Science - Applied Informatics ◽

10.1007/978-3-030-32475-9_17 ◽

2019 ◽

pp. 230-241

Author(s):

Fernando Yepes-Calderon ◽

J. Gordon McComb

Keyword(s):

Medical Imaging ◽

Gold Standard ◽

Manual Segmentation

Download Full-text

Greater intelligibility in verbal routines with young children with developmental delays

Applied Psycholinguistics ◽

10.1017/s0142716400005439 ◽

1992 ◽

Vol 13 (1) ◽

pp. 77-91 ◽

Cited By ~ 3

Author(s):

Paul J. Yoder ◽

Betty Davies

Keyword(s):

Representative Sample ◽

Developmental Delays ◽

Contextual Information ◽

Language Assessment ◽

Language Intervention ◽

Context Information ◽

Developmentally Delayed ◽

Child Speech ◽

The Relationship ◽

Children's Speech

ABSTRACTThe unintelligible speech of many developmentally delayed children poses problems for language intervention and language assessment efforts. Eighteen developmentally delayed children in Brown's (1973) stage I and their parents participated in two studies of the relationship between verbal routines and the intelligibility of developmentally delayed children's speech. The first study demonstrated that more intelligible child speech was found in routines than in nonroutines. To determine if routine utterances were articulated more accurately than nonroutine utterances, the second study extracted a representative sample of routine and nonroutine utterances from their visual and discourse contexts and asked two naive observers to transcribe them. To investigate the possible effect of contextual information, the naive observers transcribed the extracted utterances under context-information-present and context-information- absent conditions. The results indicated that extracted utterances were more intelligible under context-information-present conditions. The results were interpreted as indicating that child speech was more intelligible in routines than nonroutines because routines provide adults with more context information for interpreting ambiguous child utterances.

Download Full-text

Communication Breakdowns in Normal and Language Learning-Disabled Children's Conversation and Narration

Journal of Speech and Hearing Disorders ◽

10.1044/jshd.5301.02 ◽

1988 ◽

Vol 53 (1) ◽

pp. 2-7 ◽

Cited By ~ 82

Author(s):

Barbara G. MacLachlan ◽

Robin S. Chapman

Keyword(s):

Language Learning ◽

Unit Length ◽

Learning Disabled ◽

Group Differences ◽

Chronological Age ◽

Control Group ◽

Speech Sample ◽

Disabled Children ◽

Communication Breakdowns ◽

Children's Speech

The frequency and type of communication breakdowns occurring in the speech of 7 language learning-disabled children (LLD), aged 9:10–11:1 (years:months), were examined in two conditions, conversation and narration, and compared to a group of 7 normal peers matched for chronological age and 7 peers matched for mean length of communication unit in conversation. Types of communication breakdowns examined included stalls, repairs, and abandoned utterances. The LLD group incurred a significantly greater rate of communication breakdowns per communication unit in narration than conversation compared to control group differences. Mean length of communication unit was also significantly greater in narration than conversation for the LLD group compared to controls. For all groups, across both speech sample conditions, longer communication units contained more breakdowns than shorter ones. The groups did not differ in the types of breakdowns. Communication unit length and the nature of the narrative task may account for the increased dysfluencies in LLD children's speech.

Download Full-text