Speechreading using Modified Visual Feature Vectors

Audio-Visual Speech Recognition (AVSR) is an emerging technology that helps in improved machine perception of speech by taking into account the bimodality of human speech. Automated speech is inspired from the fact that human beings subconsciously use visual cues to interpret speech. This chapter surveys the techniques for audio-visual speech recognition. Through this survey, the authors discuss the steps involved in a robust mechanism for perception of speech for human-computer interaction. The main emphasis is on visual speech recognition taking only the visual cues into account. Previous research has shown that visual-only speech recognition systems pose many challenges. The authors present a speech recognition system where only the visual modality is used for recognition of the spoken word. Significant features are extracted from lip images. These features are used to build n-gram feature vectors. Classification of speech using these modified feature vectors results in improved accuracy of the spoken word.

Download Full-text

Lips detection for audio-visual speech recognition system

2008 International Symposium on Intelligent Signal Processing and Communications Systems ◽

10.1109/ispacs.2009.4806689 ◽

2009 ◽

Author(s):

Siew Wen Chin ◽

Li-Minn Ang ◽

Kah Phooi Seng

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Visual Speech ◽

Speech Recognition System ◽

Visual Speech Recognition

Download Full-text

Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

Applied Sciences ◽

10.3390/app10207263 ◽

2020 ◽

Vol 10 (20) ◽

pp. 7263

Author(s):

Yong-Hyeok Lee ◽

Dong-Won Jang ◽

Jae-Bin Kim ◽

Rae-Hong Park ◽

Hyung-Min Park

Keyword(s):

Speech Recognition ◽

Machine Translation ◽

Short Term Memory ◽

Visual Speech ◽

Visual Modality ◽

Neural Machine Translation ◽

Visual Speech Recognition ◽

Context Vector ◽

Improved Performance ◽

Transformer Model

Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, audio–visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. As a result that the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities. In order to increase the role of visual modality to a level of audio modality by fully exploiting input information in learning attentions, we propose a dual cross-modality (DCM) attention scheme that utilizes both an audio context vector using video query and a video context vector using audio query. Furthermore, we introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments required in AVSR. Recognition experiments on LRS2-BBC and LRS3-TED datasets showed that the proposed model with the DCM attention scheme and the hybrid CTC/attention architecture achieved at least a relative improvement of 7.3% on average in the word error rate (WER) compared to competing methods based on the transformer model.

Download Full-text

Audio-only backoff in audio-visual speech recognition system

The Journal of the Acoustical Society of America ◽

10.1121/1.3155497 ◽

2009 ◽

Vol 125 (6) ◽

pp. 4109

Author(s):

Jonathan H. Connell

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Visual Speech ◽

Speech Recognition System ◽

Visual Speech Recognition

Download Full-text

Lip-based visual speech recognition system

2015 International Carnahan Conference on Security Technology (ICCST) ◽

10.1109/ccst.2015.7389703 ◽

2015 ◽

Cited By ~ 1

Author(s):

Aufaclav Zatu Kusuma Frisky ◽

Chien-Yao Wang ◽

Andri Santoso ◽

Jia-Ching Wang

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Visual Speech ◽

Speech Recognition System ◽

Visual Speech Recognition

Download Full-text

Listening with Your Eyes: Towards a Practical Visual Speech Recognition System Using Deep Boltzmann Machines

2015 IEEE International Conference on Computer Vision (ICCV) ◽

10.1109/iccv.2015.26 ◽

2015 ◽

Cited By ~ 12

Author(s):

Chao Sui ◽

Mohammed Bennamoun ◽

Roberto Togneri

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Visual Speech ◽

Speech Recognition System ◽

Boltzmann Machines ◽

Visual Speech Recognition ◽

Deep Boltzmann Machines

Download Full-text

Multisensory Integration-Attention Trade-Off in Cochlear-Implanted Deaf Individuals

Frontiers in Neuroscience ◽

10.3389/fnins.2021.683804 ◽

2021 ◽

Vol 15 ◽

Author(s):

Luuk P. H. van de Rijt ◽

A. John van Opstal ◽

Marc M. van Wanrooij

Keyword(s):

Speech Recognition ◽

Visual Cues ◽

Normal Hearing ◽

Situational Factors ◽

Visual Speech ◽

Noisy Environments ◽

Trade Off ◽

Visual Speech Recognition ◽

Attention Tasks ◽

Audiovisual Speech Recognition

The cochlear implant (CI) allows profoundly deaf individuals to partially recover hearing. Still, due to the coarse acoustic information provided by the implant, CI users have considerable difficulties in recognizing speech, especially in noisy environments. CI users therefore rely heavily on visual cues to augment speech recognition, more so than normal-hearing individuals. However, it is unknown how attention to one (focused) or both (divided) modalities plays a role in multisensory speech recognition. Here we show that unisensory speech listening and reading were negatively impacted in divided-attention tasks for CI users—but not for normal-hearing individuals. Our psychophysical experiments revealed that, as expected, listening thresholds were consistently better for the normal-hearing, while lipreading thresholds were largely similar for the two groups. Moreover, audiovisual speech recognition for normal-hearing individuals could be described well by probabilistic summation of auditory and visual speech recognition, while CI users were better integrators than expected from statistical facilitation alone. Our results suggest that this benefit in integration comes at a cost. Unisensory speech recognition is degraded for CI users when attention needs to be divided across modalities. We conjecture that CI users exhibit an integration-attention trade-off. They focus solely on a single modality during focused-attention tasks, but need to divide their limited attentional resources in situations with uncertainty about the upcoming stimulus modality. We argue that in order to determine the benefit of a CI for speech recognition, situational factors need to be discounted by presenting speech in realistic or complex audiovisual environments.

Download Full-text