Method and apparatus for audio-visual speech detection and recognition

2005 ◽  
Vol 118 (2) ◽  
pp. 597
Author(s):  
Sankar Basu
2017 ◽  
Vol 60 (1) ◽  
pp. 136-143 ◽  
Author(s):  
Robert S. Schlauch ◽  
Heekyung J. Han ◽  
Tzu-Ling J. Yu ◽  
Edward Carney

Purpose The purpose of this article is to examine explanations for pure-tone average–spondee threshold differences in functional hearing loss. Method Loudness magnitude estimation functions were obtained from 24 participants for pure tones (0.5 and 1.0 kHz), vowels, spondees, and speech-shaped noise as a function of level (20–90 dB SPL). Participants listened monaurally through earphones. Loudness predictions were obtained for the same stimuli by using a computational, dynamic loudness model. Results When evaluated at the same SPL, speech-shaped noise was judged louder than vowels/spondees, which were judged louder than tones. Equal-loudness levels were inferred from fitted loudness functions for the group. For the clinical application, the 2.1-dB difference between spondees and tones at equal loudness became a 12.1-dB difference when the stimuli were converted from SPL to HL. Conclusions Nearly all of the pure-tone average–spondee threshold differences in functional hearing loss are attributable to references for calibration for 0 dB HL for tones and speech, which are based on detection and recognition, respectively. The recognition threshold for spondees is roughly 9 dB higher than the speech detection threshold; persons feigning a loss, who base loss magnitude on loudness, do not consider this difference. Furthermore, the dynamic loudness model was more accurate than the static model.


2016 ◽  
Vol 6 (1) ◽  
Author(s):  
Takefumi Ohki ◽  
Atsuko Gunji ◽  
Yuichi Takei ◽  
Hidetoshi Takahashi ◽  
Yuu Kaneko ◽  
...  

2021 ◽  
Author(s):  
Shashidhar R ◽  
Sudarshan Patil Kulkarni

Abstract In the current scenario, audio visual speech recognition is one of the emerging fields of research, but there is still deficiency of appropriate visual features for recognition of visual speech. Human lip-readers are increasingly being presented as useful in the gathering of forensic evidence but, like all human, suffer from unreliability in analyzing the lip movement. Here we used a custom dataset and design the system in such a way that it predicts the output for the lip reading. The problem of speaker independent lip-reading is very demanding due to unpredictable variations between people. Also due to recent developments and advances in the fields of signal processing and computer vision. The task of automating the lip reading is becoming a field of great interest. Here we use MFCC techniques for audio processing and LSTM method for visual speech recognition and finally integrate the audio and video using feed forward neural network (FFNN) and also got good accuracy. That is why the AVSR technique attract a great attention as a reliable solution for the speech detection problem. The final model was capable of taking more appropriate decision while predicting the spoken word. We were able to get a good accuracy of about 92.38% for the final model.


2004 ◽  
Vol 44 (1-4) ◽  
pp. 19-30 ◽  
Author(s):  
Jeesun Kim ◽  
Chris Davis

2019 ◽  
Vol 62 (10) ◽  
pp. 3860-3875 ◽  
Author(s):  
Kaylah Lalonde ◽  
Lynne A. Werner

Purpose This study assessed the extent to which 6- to 8.5-month-old infants and 18- to 30-year-old adults detect and discriminate auditory syllables in noise better in the presence of visual speech than in auditory-only conditions. In addition, we examined whether visual cues to the onset and offset of the auditory signal account for this benefit. Method Sixty infants and 24 adults were randomly assigned to speech detection or discrimination tasks and were tested using a modified observer-based psychoacoustic procedure. Each participant completed 1–3 conditions: auditory-only, with visual speech, and with a visual signal that only cued the onset and offset of the auditory syllable. Results Mixed linear modeling indicated that infants and adults benefited from visual speech on both tasks. Adults relied on the onset–offset cue for detection, but the same cue did not improve their discrimination. The onset–offset cue benefited infants for both detection and discrimination. Whereas the onset–offset cue improved detection similarly for infants and adults, the full visual speech signal benefited infants to a lesser extent than adults on the discrimination task. Conclusions These results suggest that infants' use of visual onset–offset cues is mature, but their ability to use more complex visual speech cues is still developing. Additional research is needed to explore differences in audiovisual enhancement (a) of speech discrimination across speech targets and (b) with increasingly complex tasks and stimuli.


1978 ◽  
Vol 85 (3) ◽  
pp. 192-206 ◽  
Author(s):  
David M. Green ◽  
Theodore G. Birdsall

Sign in / Sign up

Export Citation Format

Share Document