Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

Applied Sciences ◽

10.3390/app10207263 ◽

2020 ◽

Vol 10 (20) ◽

pp. 7263

Author(s):

Yong-Hyeok Lee ◽

Dong-Won Jang ◽

Jae-Bin Kim ◽

Rae-Hong Park ◽

Hyung-Min Park

Keyword(s):

Speech Recognition ◽

Machine Translation ◽

Short Term Memory ◽

Visual Speech ◽

Visual Modality ◽

Neural Machine Translation ◽

Visual Speech Recognition ◽

Context Vector ◽

Improved Performance ◽

Transformer Model

Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, audio–visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. As a result that the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities. In order to increase the role of visual modality to a level of audio modality by fully exploiting input information in learning attentions, we propose a dual cross-modality (DCM) attention scheme that utilizes both an audio context vector using video query and a video context vector using audio query. Furthermore, we introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments required in AVSR. Recognition experiments on LRS2-BBC and LRS3-TED datasets showed that the proposed model with the DCM attention scheme and the hybrid CTC/attention architecture achieved at least a relative improvement of 7.3% on average in the word error rate (WER) compared to competing methods based on the transformer model.

Download Full-text

Speechreading using Modified Visual Feature Vectors

Emerging Applications of Natural Language Processing ◽

10.4018/978-1-4666-2169-5.ch012 ◽

2013 ◽

pp. 292-315

Author(s):

Preety Singh ◽

Vijay Laxmi ◽

M. S. Gaur

Keyword(s):

Speech Recognition ◽

Visual Cues ◽

Recognition System ◽

Spoken Word ◽

Visual Speech ◽

Visual Modality ◽

Feature Vectors ◽

Machine Perception ◽

Visual Speech Recognition ◽

Audio-Visual Speech Recognition (AVSR) is an emerging technology that helps in improved machine perception of speech by taking into account the bimodality of human speech. Automated speech is inspired from the fact that human beings subconsciously use visual cues to interpret speech. This chapter surveys the techniques for audio-visual speech recognition. Through this survey, the authors discuss the steps involved in a robust mechanism for perception of speech for human-computer interaction. The main emphasis is on visual speech recognition taking only the visual cues into account. Previous research has shown that visual-only speech recognition systems pose many challenges. The authors present a speech recognition system where only the visual modality is used for recognition of the spoken word. Significant features are extracted from lip images. These features are used to build n-gram feature vectors. Classification of speech using these modified feature vectors results in improved accuracy of the spoken word.

Download Full-text

Asynchrony modeling for audio-visual speech recognition

10.3115/1289189.1289244 ◽

2002 ◽

Author(s):

Guillaume Gravier ◽

Gerasimos Potamianos ◽

Chalapathy Neti

Keyword(s):

Speech Recognition ◽

Visual Speech ◽

Visual Speech Recognition

Download Full-text

Measuring the effect of high-speed video data on the audio-visual speech recognition accuracy

Information and Control Systems ◽

10.31799/1684-8853-2019-2-26-34 ◽

2019 ◽

pp. 26-34

Author(s):

D. V. Ivanko ◽

D. A. Ryumin ◽

A. A. Karpov ◽

M. Zelezny

Keyword(s):

Speech Recognition ◽

Recognition Accuracy ◽

Visual Speech ◽

High Speed Video ◽

Visual Speech Recognition

Download Full-text

Visual speech recognition for small scale dataset using VGG16 convolution neural network

Multimedia Tools and Applications ◽

10.1007/s11042-021-11119-0 ◽

2021 ◽

Author(s):

Shashidhar R ◽

Sudarshan Patilkulkarni

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Convolution Neural Network ◽

Visual Speech ◽

Small Scale ◽

Visual Speech Recognition

Download Full-text

Speaker-Independent Visual Speech Recognition with the Inception V3 Model

2021 IEEE Spoken Language Technology Workshop (SLT) ◽

10.1109/slt48900.2021.9383540 ◽

2021 ◽

Author(s):

Timothy Israel Santos ◽

Andrew Abel ◽

Nick Wilson ◽

Yan Xu

Keyword(s):

Speech Recognition ◽

Visual Speech ◽

Visual Speech Recognition ◽

Speaker Independent

Download Full-text

Part-Based Lipreading for Audio-Visual Speech Recognition

2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC) ◽

10.1109/smc42975.2020.9283044 ◽

2020 ◽

Author(s):

Ziling Miao ◽

Hong Liu ◽

Bing Yang

Keyword(s):

Speech Recognition ◽

Visual Speech ◽

Visual Speech Recognition

Download Full-text

Comparison between different feature extraction techniques for audio-visual speech recognition

Journal on Multimodal User Interfaces ◽

10.1007/bf02884428 ◽

2007 ◽

Vol 1 (1) ◽

pp. 7-20 ◽

Author(s):

Alin G. Chiţu ◽

Leon J. M. Rothkrantz ◽

Pascal Wiggers ◽

Jacek C. Wojdel

Keyword(s):

Feature Extraction ◽

Speech Recognition ◽

Visual Speech ◽

Extraction Techniques ◽

Visual Speech Recognition

Download Full-text

European and American Audio-Visual Speech Recognition, Using SVM in Portuguese Language

Data Compression Conference (dcc 2008) ◽

10.1109/dcc.2008.32 ◽

2008 ◽

Author(s):

Adriano de Andrade Bresolin ◽

Diamantino Rui da Silva da Silva Freitas ◽

Adriao Duarte Doria Neto ◽

Pablo Javier Alsina

Keyword(s):

Speech Recognition ◽

Visual Speech ◽

Visual Speech Recognition

Download Full-text

Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop

2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564) ◽

10.1109/mmsp.2001.962801 ◽

2002 ◽

Author(s):

C. Neti ◽

G. Potamianos ◽

J. Luettin ◽

I. Matthews ◽

H. Glotin ◽

...

Keyword(s):

Speech Recognition ◽

Visual Speech ◽

Large Vocabulary ◽

Visual Speech Recognition

Download Full-text

Lips detection for audio-visual speech recognition system

2008 International Symposium on Intelligent Signal Processing and Communications Systems ◽

10.1109/ispacs.2009.4806689 ◽

2009 ◽

Author(s):

Siew Wen Chin ◽

Li-Minn Ang ◽

Kah Phooi Seng

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Visual Speech ◽

Speech Recognition System ◽

Visual Speech Recognition

Download Full-text