The Influence of Selective Attention to Auditory and Visual Speech on the Integration of Audiovisual Speech Information

Perception ◽  
10.1068/p6939 ◽  
2011 ◽  
Vol 40 (10) ◽  
pp. 1164-1182 ◽  
Author(s):  
Julie N Buchan ◽  
Kevin G Munhall
Author(s):  
Lawrence D. Rosenblum

Research on visual and audiovisual speech information has profoundly influenced the fields of psycholinguistics, perception psychology, and cognitive neuroscience. Visual speech findings have provided some of most the important human demonstrations of our new conception of the perceptual brain as being supremely multimodal. This “multisensory revolution” has seen a tremendous growth in research on how the senses integrate, cross-facilitate, and share their experience with one another. The ubiquity and apparent automaticity of multisensory speech has led many theorists to propose that the speech brain is agnostic with regard to sense modality: it might not know or care from which modality speech information comes. Instead, the speech function may act to extract supramodal informational patterns that are common in form across energy streams. Alternatively, other theorists have argued that any common information existent across the modalities is minimal and rudimentary, so that multisensory perception largely depends on the observer’s associative experience between the streams. From this perspective, the auditory stream is typically considered primary for the speech brain, with visual speech simply appended to its processing. If the utility of multisensory speech is a consequence of a supramodal informational coherence, then cross-sensory “integration” may be primarily a consequence of the informational input itself. If true, then one would expect to see evidence for integration occurring early in the perceptual process, as well in a largely complete and automatic/impenetrable manner. Alternatively, if multisensory speech perception is based on associative experience between the modal streams, then no constraints on how completely or automatically the senses integrate are dictated. There is behavioral and neurophysiological research supporting both perspectives. Much of this research is based on testing the well-known McGurk effect, in which audiovisual speech information is thought to integrate to the extent that visual information can affect what listeners report hearing. However, there is now good reason to believe that the McGurk effect is not a valid test of multisensory integration. For example, there are clear cases in which responses indicate that the effect fails, while other measures suggest that integration is actually occurring. By mistakenly conflating the McGurk effect with speech integration itself, interpretations of the completeness and automaticity of multisensory may be incorrect. Future research should use more sensitive behavioral and neurophysiological measures of cross-modal influence to examine these issues.


2008 ◽  
Vol 17 (6) ◽  
pp. 405-409 ◽  
Author(s):  
Lawrence D. Rosenblum

Speech perception is inherently multimodal. Visual speech (lip-reading) information is used by all perceivers and readily integrates with auditory speech. Imaging research suggests that the brain treats auditory and visual speech similarly. These findings have led some researchers to consider that speech perception works by extracting amodal information that takes the same form across modalities. From this perspective, speech integration is a property of the input information itself. Amodal speech information could explain the reported automaticity, immediacy, and completeness of audiovisual speech integration. However, recent findings suggest that speech integration can be influenced by higher cognitive properties such as lexical status and semantic context. Proponents of amodal accounts will need to explain these results.


2012 ◽  
Vol 25 (0) ◽  
pp. 186
Author(s):  
Daniel Drebing ◽  
Jared Medina ◽  
H. Branch Coslett ◽  
Jeffrey T. Shenton ◽  
Roy H. Hamilton

Integrating sensory information across modalities is necessary for a cohesive experience of the world; disrupting the ability to bind the multisensory stimuli arising from an event leads to a disjointed and confusing percept. We previously reported (Hamilton et al., 2006) a patient, AWF, who suffered an acute neural incident after which he displayed a distinct inability to integrate auditory and visual speech information. While our prior experiments involving AWF suggested that he had a deficit of audiovisual speech processing, they did not explore the hypothesis that his deficits in audiovisual integration are restricted to speech. In order to test this notion, we conducted a series of experiments aimed at testing AWF’s ability to integrate cross-modal information from both speech and non-speech events. AWF was tasked with making temporal order judgments (TOJs) for videos of object noises (such as hands clapping) or speech, wherein the onsets of auditory and visual information were manipulated. Results from the experiments show that while AWF performed worse than controls in his ability to accurately judge even the most salient onset differences for speech videos, he did not differ significantly from controls in his ability to make TOJs for the object videos. These results illustrate the possibility of disruption of intermodal binding for audiovisual speech events with spared binding for real-world, non-speech events.


Perception ◽  
10.1068/p5852 ◽  
2007 ◽  
Vol 36 (10) ◽  
pp. 1535-1545 ◽  
Author(s):  
Ian T Everdell ◽  
Heidi Marsh ◽  
Micheal D Yurick ◽  
Kevin G Munhall ◽  
Martin Paré

Speech perception under natural conditions entails integration of auditory and visual information. Understanding how visual and auditory speech information are integrated requires detailed descriptions of the nature and processing of visual speech information. To understand better the process of gathering visual information, we studied the distribution of face-directed fixations of humans performing an audiovisual speech perception task to characterise the degree of asymmetrical viewing and its relationship to speech intelligibility. Participants showed stronger gaze fixation asymmetries while viewing dynamic faces, compared to static faces or face-like objects, especially when gaze was directed to the talkers' eyes. Although speech perception accuracy was significantly enhanced by the viewing of congruent, dynamic faces, we found no correlation between task performance and gaze fixation asymmetry. Most participants preferentially fixated the right side of the faces and their preferences persisted while viewing horizontally mirrored stimuli, different talkers, or static faces. These results suggest that the asymmetrical distributions of gaze fixations reflect the participants' viewing preferences, rather than being a product of asymmetrical faces, but that this behavioural bias does not predict correct audiovisual speech perception.


1997 ◽  
Vol 40 (2) ◽  
pp. 432-443 ◽  
Author(s):  
Karen S. Helfer

Research has shown that speaking in a deliberately clear manner can improve the accuracy of auditory speech recognition. Allowing listeners access to visual speech cues also enhances speech understanding. Whether the nature of information provided by speaking clearly and by using visual speech cues is redundant has not been determined. This study examined how speaking mode (clear vs. conversational) and presentation mode (auditory vs. auditory-visual) influenced the perception of words within nonsense sentences. In Experiment 1, 30 young listeners with normal hearing responded to videotaped stimuli presented audiovisually in the presence of background noise at one of three signal-to-noise ratios. In Experiment 2, 9 participants returned for an additional assessment using auditory-only presentation. Results of these experiments showed significant effects of speaking mode (clear speech was easier to understand than was conversational speech) and presentation mode (auditoryvisual presentation led to better performance than did auditory-only presentation). The benefit of clear speech was greater for words occurring in the middle of sentences than for words at either the beginning or end of sentences for both auditory-only and auditory-visual presentation, whereas the greatest benefit from supplying visual cues was for words at the end of sentences spoken both clearly and conversationally. The total benefit from speaking clearly and supplying visual cues was equal to the sum of each of these effects. Overall, the results suggest that speaking clearly and providing visual speech information provide complementary (rather than redundant) information.


2005 ◽  
Vol 40 ◽  
pp. 19-32
Author(s):  
Sascha Fagel

The author presents MASSY, the MODULAR AUDIOVISUAL SPEECH SYNTHESIZER. The system combines two approaches of visual speech synthesis. Two control models are implemented: a (data based) di-viseme model and a (rule based) dominance model where both produce control commands in a parameterized articulation space. Analogously two visualization methods are implemented: an image based (video-realistic) face model and a 3D synthetic head. Both face models can be driven by both the data based and the rule based articulation model. The high-level visual speech synthesis generates a sequence of control commands for the visible articulation. For every virtual articulator (articulation parameter) the 3D synthetic face model defines a set of displacement vectors for the vertices of the 3D objects of the head. The vertices of the 3D synthetic head then are moved by linear combinations of these displacement vectors to visualize articulation movements. For the image based video synthesis a single reference image is deformed to fit the facial properties derived from the control commands. Facial feature points and facial displacements have to be defined for the reference image. The algorithm can also use an image database with appropriately annotated facial properties. An example database was built automatically from video recordings. Both the 3D synthetic face and the image based face generate visual speech that is capable to increase the intelligibility of audible speech. Other well known image based audiovisual speech synthesis systems like MIKETALK and VIDEO REWRITE concatenate pre-recorded single images or video sequences, respectively. Parametric talking heads like BALDI control a parametric face with a parametric articulation model. The presented system demonstrates the compatibility of parametric and data based visual speech synthesis approaches.  


2020 ◽  
Author(s):  
Jonathan E Peelle ◽  
Brent Spehar ◽  
Michael S Jones ◽  
Sarah McConkey ◽  
Joel Myerson ◽  
...  

In everyday conversation, we usually process the talker's face as well as the sound of their voice. Access to visual speech information is particularly useful when the auditory signal is degraded. Here we used fMRI to monitor brain activity while adults (n = 60) were presented with visual-only, auditory-only, and audiovisual words. As expected, audiovisual speech perception recruited both auditory and visual cortex, with a trend towards increased recruitment of premotor cortex in more difficult conditions (for example, in substantial background noise). We then investigated neural connectivity using psychophysiological interaction (PPI) analysis with seed regions in both primary auditory cortex and primary visual cortex. Connectivity between auditory and visual cortices was stronger in audiovisual conditions than in unimodal conditions, including a wide network of regions in posterior temporal cortex and prefrontal cortex. Taken together, our results suggest a prominent role for cross-region synchronization in understanding both visual-only and audiovisual speech.


Author(s):  
Doğu Erdener

Speech perception has long been taken for granted as an auditory-only process. However, it is now firmly established that speech perception is an auditory-visual process in which visual speech information in the form of lip and mouth movements are taken into account in the speech perception process. Traditionally, foreign language (L2) instructional methods and materials are auditory-based. This chapter presents a general framework of evidence that visual speech information will facilitate L2 instruction. The author claims that this knowledge will form a bridge to cover the gap between psycholinguistics and L2 instruction as an applied field. The chapter also describes how orthography can be used in L2 instruction. While learners from a transparent L1 orthographic background can decipher phonology of orthographically transparent L2s –overriding the visual speech information – that is not the case for those from orthographically opaque L1s.


2016 ◽  
Vol 44 (1) ◽  
pp. 185-215 ◽  
Author(s):  
SUSAN JERGER ◽  
MARKUS F. DAMIAN ◽  
NANCY TYE-MURRAY ◽  
HERVÉ ABDI

AbstractAdults use vision to perceive low-fidelity speech; yet how children acquire this ability is not well understood. The literature indicates that children show reduced sensitivity to visual speech from kindergarten to adolescence. We hypothesized that this pattern reflects the effects of complex tasks and a growth period with harder-to-utilize cognitive resources, not lack of sensitivity. We investigated sensitivity to visual speech in children via the phonological priming produced by low-fidelity (non-intact onset) auditory speech presented audiovisually (see dynamic face articulate consonant/rhyme b/ag; hear non-intact onset/rhyme: –b/ag) vs. auditorily (see still face; hear exactly same auditory input). Audiovisual speech produced greater priming from four to fourteen years, indicating that visual speech filled in the non-intact auditory onsets. The influence of visual speech depended uniquely on phonology and speechreading. Children – like adults – perceive speech onsets multimodally. Findings are critical for incorporating visual speech into developmental theories of speech perception.


Sign in / Sign up

Export Citation Format

Share Document