Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition

In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications.

Download Full-text

Feature extraction using multimodal convolutional neural networks for visual speech recognition

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2017.7952701 ◽

2017 ◽

Cited By ~ 11

Author(s):

Eric Tatulli ◽

Thomas Hueber

Keyword(s):

Neural Networks ◽

Feature Extraction ◽

Speech Recognition ◽

Convolutional Neural Networks ◽

Visual Speech ◽

Visual Speech Recognition

Download Full-text

Audio-Visual Speech Recognition using 3D Convolutional Neural Networks

10.1109/asyu52992.2021.9599016 ◽

2021 ◽

Author(s):

Ceren Belhan ◽

Damla Fikirdanis ◽

Ovgu Cimen ◽

Pelin Pasinli ◽

Zeynep Akgun ◽

...

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Convolutional Neural Networks ◽

Visual Speech ◽

Visual Speech Recognition

Download Full-text

Improving the Recognition Performance of Lip Reading Using the Concatenated Three Sequence Keyframe Image Technique

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.4102 ◽

2021 ◽

Vol 11 (2) ◽

pp. 6986-6992

Author(s):

L. Poomhiran ◽

P. Meesad ◽

S. Nuanmeesri

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Convolutional Neural Networks ◽

Recognition Performance ◽

Visual Speech ◽

Image Frame ◽

Visual Speech Recognition ◽

Lip Reading ◽

Image Technique ◽

Accuracy Validation

This paper proposes a lip reading method based on convolutional neural networks applied to Concatenated Three Sequence Keyframe Image (C3-SKI), consisting of (a) the Start-Lip Image (SLI), (b) the Middle-Lip Image (MLI), and (c) the End-Lip Image (ELI) which is the end of the pronunciation of that syllable. The lip area’s image dimensions were reduced to 32×32 pixels per image frame and three keyframes concatenate together were used to represent one syllable with a dimension of 96×32 pixels for visual speech recognition. Every three concatenated keyframes representing any syllable are selected based on the relative maximum and relative minimum related to the open lip’s width and height. The evaluation results of the model’s effectiveness, showed accuracy, validation accuracy, loss, and validation loss values at 95.06%, 86.03%, 4.61%, and 9.04% respectively, for the THDigits dataset. The C3-SKI technique was also applied to the AVDigits dataset, showing 85.62% accuracy. In conclusion, the C3-SKI technique could be applied to perform lip reading recognition.

Download Full-text

Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2017.7953172 ◽

2017 ◽

Cited By ~ 10

Author(s):

Hendrik Meutzner ◽

Ning Ma ◽

Robert Nickel ◽

Christopher Schymura ◽

Dorothea Kolossa

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Deep Neural Networks ◽

Visual Speech ◽

Visual Speech Recognition ◽

Reliability Estimates

Download Full-text

A NOVEL TASK-ORIENTED APPROACH TOWARD AUTOMATED LIP-READING SYSTEM IMPLEMENTATION

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliv-2-w1-2021-85-2021 ◽

2021 ◽

Vol XLIV-2/W1-2021 ◽

pp. 85-89

Author(s):

D. Ivanko ◽

D. Ryumin

Keyword(s):

Speech Recognition ◽

Visual Information ◽

Visual Speech ◽

System Implementation ◽

Visual Speech Recognition ◽

Rough Approximation ◽

Lip Reading ◽

Reading System ◽

Task Oriented ◽

Oriented Approach

Abstract. Visual information plays a key role in automatic speech recognition (ASR) when audio is corrupted by background noise, or even inaccessible. Speech recognition using visual information is called lip-reading. The initial idea of visual speech recognition comes from humans’ experience: we are able to recognize spoken words from the observation of a speaker's face without or with limited access to the sound part of the voice. Based on the conducted experimental evaluations as well as on analysis of the research field we propose a novel task-oriented approach towards practical lip-reading system implementation. Its main purpose is to be some kind of a roadmap for researchers who need to build a reliable visual speech recognition system for their task. In a rough approximation, we can divide the task of lip-reading into two parts, depending on the complexity of the problem. First, if we need to recognize isolated words, numbers or small phrases (e.g. Telephone numbers with a strict grammar or keywords). Or second, if we need to recognize continuous speech (phrases or sentences). All these stages disclosed in detail in this paper. Based on the proposed approach we implemented from scratch automatic visual speech recognition systems of three different architectures: GMM-CHMM, DNN-HMM and purely End-to-end. A description of the methodology, tools, step-by-step development and all necessary parameters are disclosed in detail in current paper. It is worth noting that for the Russian speech recognition, such systems were created for the first time.

Download Full-text

Enhancing Robustness in Speech Recognition using Visual Information

Speech, Image, and Language Processing for Human Computer Interaction ◽

10.4018/978-1-4666-0954-9.ch008 ◽

2012 ◽

pp. 149-171

Author(s):

Omar Farooq ◽

Sekharjit Datta

Keyword(s):

Speech Recognition ◽

Visual Information ◽

Recognition Performance ◽

Recognition Task ◽

Visual Speech ◽

Future Research ◽

Visual Speech Recognition ◽

Important Challenge ◽

Future Research Directions ◽

Main Components

The area of speech recognition has been thoroughly researched during the past fifty years; however, robustness is still an important challenge to overcome. It has been established that there exists a correlation between speech produced and lip motion which is helpful in the adverse background conditions to improve the recognition performance. This chapter presents main components used in audio-visual speech recognition systems. Results of a prototype experiment conducted on audio-visual corpora for Hindi speech have been reported of simple phoneme recognition task. The chapter also addresses some of the issues related to visual feature extraction and the integration of audio-visual and finally present future research directions.

Download Full-text