Visual speech segmentation: Using facial cues to locate word boundaries in continuous speech

2010 ◽  
Author(s):  
Aaron D. Mitchel ◽  
Daniel J. Weiss



2001 ◽  
Vol 27 (3) ◽  
pp. 351-372 ◽  
Author(s):  
Anand Venkataraman

A statistical model for segmentation and word discovery in continuous speech is presented. An incremental unsupervised learning algorithm to infer word boundaries based on this model is described. Results are also presented of empirical tests showing that the algorithm is competitive with other models that have been used for similar tasks.



1997 ◽  
Vol 33 (2) ◽  
pp. 111-153 ◽  
Author(s):  
Paul Cairns ◽  
Richard Shillcock ◽  
Nick Chater ◽  
Joe Levy


2021 ◽  
Vol 12 ◽  
Author(s):  
Theresa Matzinger ◽  
Nikolaus Ritt ◽  
W. Tecumseh Fitch

A prerequisite for spoken language learning is segmenting continuous speech into words. Amongst many possible cues to identify word boundaries, listeners can use both transitional probabilities between syllables and various prosodic cues. However, the relative importance of these cues remains unclear, and previous experiments have not directly compared the effects of contrasting multiple prosodic cues. We used artificial language learning experiments, where native German speaking participants extracted meaningless trisyllabic “words” from a continuous speech stream, to evaluate these factors. We compared a baseline condition (statistical cues only) to five test conditions, in which word-final syllables were either (a) followed by a pause, (b) lengthened, (c) shortened, (d) changed to a lower pitch, or (e) changed to a higher pitch. To evaluate robustness and generality we used three tasks varying in difficulty. Overall, pauses and final lengthening were perceived as converging with the statistical cues and facilitated speech segmentation, with pauses helping most. Final-syllable shortening hindered baseline speech segmentation, indicating that when cues conflict, prosodic cues can override statistical cues. Surprisingly, pitch cues had little effect, suggesting that duration may be more relevant for speech segmentation than pitch in our study context. We discuss our findings with regard to the contribution to speech segmentation of language-universal boundary cues vs. language-specific stress patterns.



2012 ◽  
Vol 36 (12) ◽  
pp. 3740-3748 ◽  
Author(s):  
Antoine J. Shahin ◽  
Mark A. Pitt


RELC Journal ◽  
2020 ◽  
pp. 003368822096663
Author(s):  
Debra M Hardison ◽  
Martha C Pennington

This article reviews research findings involving visual input in speech processing in the form of facial cues and co-speech gestures for second-language (L2) learners, and provides pedagogical implications for the teaching of listening and speaking. It traces the foundations of auditory–visual speech research and explores the role of a speaker’s facial cues in L2 perception training and gestural cues in listening comprehension. There is a strong role for pedagogy to maximize the salience of multimodal cues for L2 learners. Visible articulatory gestures that precede the acoustic signal and the preparation phase of a hand gesture that precedes the acoustic onset of a word provide a priming effect on perceivers’ attention to signal upcoming information and facilitate processing, and visible gestures that co-occur with speech aid ongoing processing and comprehension. L2 learners benefit from an awareness of these visual cues and exposure to input.



2019 ◽  
Author(s):  
Rodrigo Dal Ben ◽  
Débora de Hollanda Souza ◽  
Jessica Hay

Statistical regularities in linguistic input shape early language development and second language acquisition. For example, both transitional probability and phonotactic probability play a role in speech segmentation, however, it remains unclear whether or how these statistics are combined when small differences in phonotactic probabilities are presented. We conducted two experiments to investigate the effects of transitional and phonotactic probabilities on speech segmentation by Brazilian-Portuguese-speaking adults. Four pseudo-languages, with six words each, were created. The transitional probabilities between words’ biphones were high, whereas the probabilities between part-words’ biphones were lower. Although the within and between word phonotactic probability were always high, they varied slightly across the familiarization languages and test words/part-words. Languages 1 and 2 had familiarization words with unbalanced phonotactics, but target words and part-words used at test were phonotactically balanced. Languages 3 and 4 had familiarization words with balanced phonotactics, but phonotactics were unbalanced across test items; In Language 3 words had slightly lower phonotactics that part-words. The reverse was true for Language 4. Eighty-one Brazilian-Portuguese speaking adults were divided in four groups. Each group was familiarized with one version of the language and then tested on two-alternative forced choice trials. Participants presented with Languages 1, 2 and 4 preferred words to part-words at test. However, participants who heard Language 3 did not select words above chance. There was no significant difference in word selection between Language 4 and Languages 1 and 2, despite the fact that phonotactics were higher during both familiarization and test for words from the fourth language. These findings indicate that phonotactic and transitional information can be tracked and combined to facilitate or impair speech segmentation. Furthermore, they suggest that subtle differences in phonotactics are more informative of word boundaries than congruency between high phonotactic and transitional probability cues.



2017 ◽  
Author(s):  
Toben Herbert Mintz ◽  
Rachel L. Walker ◽  
Celeste Kidd ◽  
Ashlee Welday

A critical part of infants’ ability to acquire any language involves segmenting continuous speech input into discrete word-forms. Certain properties of words could provide infants with reliable cues to word boundaries. Here we investigate the potential utility of vowel harmony (VH), a phonological property whereby vowels within a word systematically exhibit similarity (“harmony”) for some aspect of the way they are pronounced. We present evidence that infants with no experience of VH in their native language nevertheless actively use these patterns to generate hypotheses about where words begin and end in the speech stream. In two experiments, we exposed infants learning English, a language without VH, to a continuous speech stream in which the only systematic patterns available to be used as cues to word boundaries came from syllable sequences that showed VH or those that showed vowel disharmony (dissimilarity). After hearing less than one minute of the streams, infants showed evidence of sensitivity to VH cues. These results suggest that infants have an experience-independent sensitivity to VH, and are predisposed to segment speech according to harmony patterns. We also found that when the VH patterns were more subtle (Experiment 2), infants required more exposure to the speech stream before they segmented based on VH, consistent with previous work on infants’ preferences relating to processing load. Our findings evidence a previously unknown mechanism by which infants could discover the words of their language, and they shed light on the perceptual mechanisms that might be responsible for the emergence of vowel harmony as an organizing principle for the sound structure of words in many languages.



Sign in / Sign up

Export Citation Format

Share Document