Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries

Author(s):  
Hang Chen ◽  
Jun Du ◽  
Yu Hu ◽  
Li-Rong Dai ◽  
Bao-Cai Yin ◽  
...  
2013 ◽  
Vol 1 (1) ◽  
pp. 14-25 ◽  
Author(s):  
Tsuyoshi Miyazaki ◽  
Toyoshiro Nakashima ◽  
Naohiro Ishii

The authors describe an improved method for detecting distinctive mouth shapes in Japanese utterance image sequences. Their previous method uses template matching. Two types of mouth shapes are formed when a Japanese phone is pronounced: one at the beginning of the utterance (the beginning mouth shape, BeMS) and the other at the end (the ending mouth shape, EMS). The authors’ previous method could detect mouth shapes, but it misdetected some shapes because the time period in which the BeMS was formed was short. Therefore, they predicted that a high-speed camera would be able to capture the BeMS with higher accuracy. Experiments showed that the BeMS could be captured; however, the authors faced another problem. Deformed mouth shapes that appeared in the transition from one shape to another were detected as the BeMS. This study describes the use of optical flow to prevent the detection of such mouth shapes. The time period in which the mouth shape is deformed is detected using optical flow, and the mouth shape during this time is ignored. The authors propose an improved method of detecting the BeMS and EMS in Japanese utterance image sequences by using template matching and optical flow.


2021 ◽  
Vol 7 (5) ◽  
pp. 91
Author(s):  
Dimitrios Tsourounis ◽  
Dimitris Kastaniotis ◽  
Spiros Fotopoulos

Lip reading (LR) is the task of predicting the speech utilizing only the visual information of the speaker. In this work, for the first time, the benefits of alternating between spatiotemporal and spatial convolutions for learning effective features from the LR sequences are studied. In this context, a new learnable module named ALSOS (Alternating Spatiotemporal and Spatial Convolutions) is introduced in the proposed LR system. The ALSOS module consists of spatiotemporal (3D) and spatial (2D) convolutions along with two conversion components (3D-to-2D and 2D-to-3D) providing a sequence-to-sequence-mapping. The designed LR system utilizes the ALSOS module in-between ResNet blocks, as well as Temporal Convolutional Networks (TCNs) in the backend for classification. The whole framework is composed by feedforward convolutional along with residual layers and can be trained end-to-end directly from the image sequences in the word-level LR problem. The ALSOS module can capture spatiotemporal dynamics and can be advantageous in the task of LR when combined with the ResNet topology. Experiments with different combinations of ALSOS with ResNet are performed on a dataset in Greek language simulating a medical support application scenario and on the popular large-scale LRW-500 dataset of English words. Results indicate that the proposed ALSOS module can improve the performance of a LR system. Overall, the insertion of ALSOS module into the ResNet architecture obtained higher classification accuracy since it incorporates the contribution of the temporal information captured at different spatial scales of the framework.


2012 ◽  
Author(s):  
Leonie M. Miller ◽  
Steven Roodenrys ◽  
Benjamin Arcioni

Sign in / Sign up

Export Citation Format

Share Document