Generating Talking Face Landmarks from Speech

Efficient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks

Applied Sciences ◽

10.3390/app11156975 ◽

2021 ◽

Vol 11 (15) ◽

pp. 6975

Author(s):

Tao Zhang ◽

Lun He ◽

Xudong Li ◽

Guoqing Feng

Keyword(s):

Performance Improvement ◽

State Of The Art ◽

Error Rates ◽

Convolutional Network ◽

Convolutional Networks ◽

Sentence Level ◽

End To End ◽

High Level ◽

Improved Accuracy ◽

Talking Face

Lipreading aims to recognize sentences being spoken by a talking face. In recent years, the lipreading method has achieved a high level of accuracy on large datasets and made breakthrough progress. However, lipreading is still far from being solved, and existing methods tend to have high error rates on the wild data and have the defects of disappearing training gradient and slow convergence. To overcome these problems, we proposed an efficient end-to-end sentence-level lipreading model, using an encoder based on a 3D convolutional network, ResNet50, Temporal Convolutional Network (TCN), and a CTC objective function as the decoder. More importantly, the proposed architecture incorporates TCN as a feature learner to decode feature. It can partly eliminate the defects of RNN (LSTM, GRU) gradient disappearance and insufficient performance, and this yields notable performance improvement as well as faster convergence. Experiments show that the training and convergence speed are 50% faster than the state-of-the-art method, and improved accuracy by 2.4% on the GRID dataset.

Download Full-text

Bilingualism Modulates Infants’ Selective Attention to the Mouth of a Talking Face

Psychological Science ◽

10.1177/0956797614568320 ◽

2015 ◽

Vol 26 (4) ◽

pp. 490-498 ◽

Cited By ~ 65

Author(s):

Ferran Pons ◽

Laura Bosch ◽

David J. Lewkowicz

Keyword(s):

Selective Attention ◽

Talking Face

Download Full-text

OMG! Chatting online as expressive as talking face-to-face

The New Scientist ◽

10.1016/s0262-4079(08)62698-6 ◽

2008 ◽

Vol 200 (2679) ◽

pp. 21

Keyword(s):

Face To Face ◽

Talking Face

Download Full-text

Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset

10.1109/cvpr46437.2021.00366 ◽

2021 ◽

Author(s):

Zhimeng Zhang ◽

Lincheng Li ◽

Yu Ding ◽

Changjie Fan

Keyword(s):

High Resolution ◽

Face Generation ◽

Talking Face

Download Full-text

Fluidity in the perception of auditory speech: Cross-modal recalibration of voice gender and vowel identity by a talking face

Quarterly Journal of Experimental Psychology ◽

10.1177/1747021819900884 ◽

2020 ◽

Vol 73 (6) ◽

pp. 957-967 ◽

Cited By ~ 1

Author(s):

Merel A Burgering ◽

Thijs van Laarhoven ◽

Martijn Baart ◽

Jean Vroomen

Keyword(s):

Speech Signal ◽

Learning Effect ◽

Selective Adaptation ◽

Female Speaker ◽

Auditory Speech ◽

Talking Face

Humans quickly adapt to variations in the speech signal. Adaptation may surface as recalibration, a learning effect driven by error-minimisation between a visual face and an ambiguous auditory speech signal, or as selective adaptation, a contrastive aftereffect driven by the acoustic clarity of the sound. Here, we examined whether these aftereffects occur for vowel identity and voice gender. Participants were exposed to male, female, or androgynous tokens of speakers pronouncing /e/, /ø/, (embedded in words with a consonant-vowel-consonant structure), or an ambiguous vowel halfway between /e/ and /ø/ dubbed onto the video of a male or female speaker pronouncing /e/ or /ø/. For both voice gender and vowel identity, we found assimilative aftereffects after exposure to auditory ambiguous adapter sounds, and contrastive aftereffects after exposure to auditory clear adapter sounds. This demonstrates that similar principles for adaptation in these dimensions are at play.

Download Full-text

Infants deploy selective attention to the mouth of a talking face when learning speech

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1114783109 ◽

2012 ◽

Vol 109 (5) ◽

pp. 1431-1436 ◽

Cited By ~ 217

Author(s):

D. J. Lewkowicz ◽

A. M. Hansen-Tift

Keyword(s):

Selective Attention ◽

Talking Face

Download Full-text

Visual scanning of a talking face when evaluating segmental and prosodic information

The Journal of the Acoustical Society of America ◽

10.1121/1.5147695 ◽

2020 ◽

Vol 148 (4) ◽

pp. 2765-2765

Author(s):

Xizi Deng ◽

Henny Yeung ◽

Yue Wang

Keyword(s):

Visual Scanning ◽

Talking Face

Download Full-text

Robust talking face video verification using joint factor analysis and sparse representation on GMM mean shifted supervectors

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2011.5946773 ◽

2011 ◽

Cited By ~ 4

Author(s):

Ming Li ◽

Shrikanth Narayanan

Keyword(s):

Factor Analysis ◽

Sparse Representation ◽

Joint Factor ◽

Joint Factor Analysis ◽

Talking Face

Download Full-text

Visual Speech Perception in Children With Language Learning Impairments

Journal of Speech Language and Hearing Research ◽

10.1044/2015_jslhr-s-14-0269 ◽

2016 ◽

Vol 59 (1) ◽

pp. 1-14 ◽

Cited By ~ 8

Author(s):

Victoria C. P. Knowland ◽

Sam Evans ◽

Caroline Snell ◽

Stuart Rosen

Keyword(s):

Speech Perception ◽

Language Learning ◽

Visual Cues ◽

Visual Speech ◽

Typically Developing ◽

Cross Sectional ◽

Learning Impairments ◽

Speech In Noise ◽

Listening In Noise ◽

Talking Face

Purpose The purpose of the study was to assess the ability of children with developmental language learning impairments (LLIs) to use visual speech cues from the talking face. Method In this cross-sectional study, 41 typically developing children (mean age: 8 years 0 months, range: 4 years 5 months to 11 years 10 months) and 27 children with diagnosed LLI (mean age: 8 years 10 months, range: 5 years 2 months to 11 years 6 months) completed a silent speechreading task and a speech-in-noise task with and without visual support from the talking face. The speech-in-noise task involved the identification of a target word in a carrier sentence with a single competing speaker as a masker. Results Children in the LLI group showed a deficit in speechreading when compared with their typically developing peers. Beyond the single-word level, this deficit became more apparent in older children. On the speech-in-noise task, a substantial benefit of visual cues was found regardless of age or group membership, although the LLI group showed an overall developmental delay in speech perception. Conclusion Although children with LLI were less accurate than their peers on the speechreading and speech-in noise-tasks, both groups were able to make equivalent use of visual cues to boost performance accuracy when listening in noise.

Download Full-text

Combining online and offline learning for tracking a talking face in video

2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops ◽

10.1109/iccvw.2009.5457448 ◽

2009 ◽

Author(s):

Quoc Dinh Nguyen ◽

Maurice Milgram

Keyword(s):

Offline Learning ◽

Talking Face

Download Full-text