scholarly journals Generating Talking Face Landmarks from Speech

Author(s):  
Sefik Emre Eskimez ◽  
Ross K. Maddox ◽  
Chenliang Xu ◽  
Zhiyao Duan
Keyword(s):  
2021 ◽  
Vol 11 (15) ◽  
pp. 6975
Author(s):  
Tao Zhang ◽  
Lun He ◽  
Xudong Li ◽  
Guoqing Feng

Lipreading aims to recognize sentences being spoken by a talking face. In recent years, the lipreading method has achieved a high level of accuracy on large datasets and made breakthrough progress. However, lipreading is still far from being solved, and existing methods tend to have high error rates on the wild data and have the defects of disappearing training gradient and slow convergence. To overcome these problems, we proposed an efficient end-to-end sentence-level lipreading model, using an encoder based on a 3D convolutional network, ResNet50, Temporal Convolutional Network (TCN), and a CTC objective function as the decoder. More importantly, the proposed architecture incorporates TCN as a feature learner to decode feature. It can partly eliminate the defects of RNN (LSTM, GRU) gradient disappearance and insufficient performance, and this yields notable performance improvement as well as faster convergence. Experiments show that the training and convergence speed are 50% faster than the state-of-the-art method, and improved accuracy by 2.4% on the GRID dataset.


2015 ◽  
Vol 26 (4) ◽  
pp. 490-498 ◽  
Author(s):  
Ferran Pons ◽  
Laura Bosch ◽  
David J. Lewkowicz

2020 ◽  
Vol 73 (6) ◽  
pp. 957-967 ◽  
Author(s):  
Merel A Burgering ◽  
Thijs van Laarhoven ◽  
Martijn Baart ◽  
Jean Vroomen

Humans quickly adapt to variations in the speech signal. Adaptation may surface as recalibration, a learning effect driven by error-minimisation between a visual face and an ambiguous auditory speech signal, or as selective adaptation, a contrastive aftereffect driven by the acoustic clarity of the sound. Here, we examined whether these aftereffects occur for vowel identity and voice gender. Participants were exposed to male, female, or androgynous tokens of speakers pronouncing /e/, /ø/, (embedded in words with a consonant-vowel-consonant structure), or an ambiguous vowel halfway between /e/ and /ø/ dubbed onto the video of a male or female speaker pronouncing /e/ or /ø/. For both voice gender and vowel identity, we found assimilative aftereffects after exposure to auditory ambiguous adapter sounds, and contrastive aftereffects after exposure to auditory clear adapter sounds. This demonstrates that similar principles for adaptation in these dimensions are at play.


2020 ◽  
Vol 148 (4) ◽  
pp. 2765-2765
Author(s):  
Xizi Deng ◽  
Henny Yeung ◽  
Yue Wang
Keyword(s):  

2016 ◽  
Vol 59 (1) ◽  
pp. 1-14 ◽  
Author(s):  
Victoria C. P. Knowland ◽  
Sam Evans ◽  
Caroline Snell ◽  
Stuart Rosen

Purpose The purpose of the study was to assess the ability of children with developmental language learning impairments (LLIs) to use visual speech cues from the talking face. Method In this cross-sectional study, 41 typically developing children (mean age: 8 years 0 months, range: 4 years 5 months to 11 years 10 months) and 27 children with diagnosed LLI (mean age: 8 years 10 months, range: 5 years 2 months to 11 years 6 months) completed a silent speechreading task and a speech-in-noise task with and without visual support from the talking face. The speech-in-noise task involved the identification of a target word in a carrier sentence with a single competing speaker as a masker. Results Children in the LLI group showed a deficit in speechreading when compared with their typically developing peers. Beyond the single-word level, this deficit became more apparent in older children. On the speech-in-noise task, a substantial benefit of visual cues was found regardless of age or group membership, although the LLI group showed an overall developmental delay in speech perception. Conclusion Although children with LLI were less accurate than their peers on the speechreading and speech-in noise-tasks, both groups were able to make equivalent use of visual cues to boost performance accuracy when listening in noise.


Sign in / Sign up

Export Citation Format

Share Document