STRAIGHT-TEMPO: a universal tool to manipulate linguistic and para-linguistic speech information

Author(s):  
H. Kawahara
Keyword(s):  
1997 ◽  
Vol 40 (2) ◽  
pp. 432-443 ◽  
Author(s):  
Karen S. Helfer

Research has shown that speaking in a deliberately clear manner can improve the accuracy of auditory speech recognition. Allowing listeners access to visual speech cues also enhances speech understanding. Whether the nature of information provided by speaking clearly and by using visual speech cues is redundant has not been determined. This study examined how speaking mode (clear vs. conversational) and presentation mode (auditory vs. auditory-visual) influenced the perception of words within nonsense sentences. In Experiment 1, 30 young listeners with normal hearing responded to videotaped stimuli presented audiovisually in the presence of background noise at one of three signal-to-noise ratios. In Experiment 2, 9 participants returned for an additional assessment using auditory-only presentation. Results of these experiments showed significant effects of speaking mode (clear speech was easier to understand than was conversational speech) and presentation mode (auditoryvisual presentation led to better performance than did auditory-only presentation). The benefit of clear speech was greater for words occurring in the middle of sentences than for words at either the beginning or end of sentences for both auditory-only and auditory-visual presentation, whereas the greatest benefit from supplying visual cues was for words at the end of sentences spoken both clearly and conversationally. The total benefit from speaking clearly and supplying visual cues was equal to the sum of each of these effects. Overall, the results suggest that speaking clearly and providing visual speech information provide complementary (rather than redundant) information.


2021 ◽  
Vol 64 (10) ◽  
pp. 4014-4029
Author(s):  
Kathy R. Vander Werff ◽  
Christopher E. Niemczak ◽  
Kenneth Morse

Purpose Background noise has been categorized as energetic masking due to spectrotemporal overlap of the target and masker on the auditory periphery or informational masking due to cognitive-level interference from relevant content such as speech. The effects of masking on cortical and sensory auditory processing can be objectively studied with the cortical auditory evoked potential (CAEP). However, whether effects on neural response morphology are due to energetic spectrotemporal differences or informational content is not fully understood. The current multi-experiment series was designed to assess the effects of speech versus nonspeech maskers on the neural encoding of speech information in the central auditory system, specifically in terms of the effects of speech babble noise maskers varying by talker number. Method CAEPs were recorded from normal-hearing young adults in response to speech syllables in the presence of energetic maskers (white or speech-shaped noise) and varying amounts of informational maskers (speech babble maskers). The primary manipulation of informational masking was the number of talkers in speech babble, and results on CAEPs were compared to those of nonspeech maskers with different temporal and spectral characteristics. Results Even when nonspeech noise maskers were spectrally shaped and temporally modulated to speech babble maskers, notable changes in the typical morphology of the CAEP in response to speech stimuli were identified in the presence of primarily energetic maskers and speech babble maskers with varying numbers of talkers. Conclusions While differences in CAEP outcomes did not reach significance by number of talkers, neural components were significantly affected by speech babble maskers compared to nonspeech maskers. These results suggest an informational masking influence on neural encoding of speech information at the sensory cortical level of auditory processing, even without active participation on the part of the listener.


2021 ◽  
Vol Publish Ahead of Print ◽  
Author(s):  
Sigrid Polspoel ◽  
Sophia E. Kramer ◽  
Bas van Dijk ◽  
Cas Smits

Author(s):  
Weigao Su ◽  
Daibo Liu ◽  
Taiyuan Zhang ◽  
Hongbo Jiang

Motion sensors in modern smartphones have been exploited for audio eavesdropping in loudspeaker mode due to their sensitivity to vibrations. In this paper, we further move one step forward to explore the feasibility of using built-in accelerometer to eavesdrop on the telephone conversation of caller/callee who takes the phone against cheek-ear and design our attack Vibphone. The inspiration behind Vibphone is that the speech-induced vibrations (SIV) can be transmitted through the physical contact of phone-cheek to accelerometer with the traces of voice content. To this end, Vibphone faces three main challenges: i) Accurately detecting SIV signals from miscellaneous disturbance; ii) Combating the impact of device diversity to work with a variety of attack scenarios; and iii) Enhancing feature-agnostic recognition model to generalize to newly issued devices and reduce training overhead. To address these challenges, we first conduct an in-depth investigation on SIV features to figure out the root cause of device diversity impacts and identify a set of critical features that are highly relevant to the voice content retained in SIV signals and independent of specific devices. On top of these pivotal observations, we propose a combo method that is the integration of extracted critical features and deep neural network to recognize speech information from the spectrogram representation of acceleration signals. We implement the attack using commodity smartphones and the results show it is highly effective. Our work brings to light a fundamental design vulnerability in the vast majority of currently deployed smartphones, which may put people's speech privacy at risk during phone calls. We also propose a practical and effective defense solution. We validate that it is feasible to prevent audio eavesdropping by using random variation of sampling rate.


Author(s):  
Doğu Erdener

Speech perception has long been taken for granted as an auditory-only process. However, it is now firmly established that speech perception is an auditory-visual process in which visual speech information in the form of lip and mouth movements are taken into account in the speech perception process. Traditionally, foreign language (L2) instructional methods and materials are auditory-based. This chapter presents a general framework of evidence that visual speech information will facilitate L2 instruction. The author claims that this knowledge will form a bridge to cover the gap between psycholinguistics and L2 instruction as an applied field. The chapter also describes how orthography can be used in L2 instruction. While learners from a transparent L1 orthographic background can decipher phonology of orthographically transparent L2s –overriding the visual speech information – that is not the case for those from orthographically opaque L1s.


Sign in / Sign up

Export Citation Format

Share Document