Extraction of Lip features for the Identification of Vowels Utterances using MFCC and Geometrical Aspects

Identification of a person’s speech by his lip movement is a challenging task. Even though many software tools available for recognition of speech to text and vice versa, some of the words uttered may not be accurate as spoken and may vary from person to person because of their pronunciation. In addition, in the noisy environment speech uttered may not perceive effectively hence there lip movement for a given speech varies. Lip reading has added advantages when it augmented with speech recognition, thus increasing the perceived information. In this paper, the video file of a individual person are converted to frames and extraction of only the lip contour for vowels is done by calculating its area and other geometrical aspects. Once this is done as a part of testing it is compared with three to four people’s lip contour for vowels for first 20 frames. The parameters such as mean, centroid will remain approximately same for all people irrespective of their lip movement but there is change in major and minor axis and hence area changes considerably. In audio domain vowel detection is carried out by extracting unique features of English vowel utterance using Mel Frequency Cepstrum Coefficients (MFCC) and the feature vectors that are orthonormalized to compare the normalized vectors with standard database and results are obtained with approximation.

Download Full-text

Assistance of Speech Recognition in Noisy Environment with Sentence Level Lip-Reading

Biometric Recognition - Lecture Notes in Computer Science ◽

10.1007/978-3-319-69923-3_64 ◽

2017 ◽

pp. 593-601

Author(s):

Jianzong Wang ◽

Yiwen Wang ◽

Aozhi Liu ◽

Jing Xiao

Keyword(s):

Speech Recognition ◽

Noisy Environment ◽

Sentence Level ◽

Lip Reading

Download Full-text

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6174 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6917-6924 ◽

Cited By ~ 1

Author(s):

Ya Zhao ◽

Rui Xu ◽

Xinchao Wang ◽

Peng Hou ◽

Haihong Tang ◽

...

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Error Rate ◽

Large Scale ◽

State Of The Art ◽

Lip Reading ◽

Speech Recognizers ◽

Lip Movement ◽

Knowledge Distillation ◽

The One

Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of large-scale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the one of its counterpart speech recognition, due to the ambiguous nature of its actuations that makes it challenging to extract discriminant features from the lip movement videos. In this paper, we propose a new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen lip reading by learning from speech recognizers. The rationale behind our approach is that the features extracted from speech recognizers may provide complementary and discriminant clues, which are formidable to be obtained from the subtle movements of the lips, and consequently facilitate the training of lip readers. This is achieved, specifically, by distilling multi-granularity knowledge from speech recognizers to lip readers. To conduct this cross-modal knowledge distillation, we utilize an efficacious alignment scheme to handle the inconsistent lengths of the audios and videos, as well as an innovative filtering strategy to refine the speech recognizer's prediction. The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7.66% and 2.75% in character error rate, respectively.

Download Full-text

Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in a noisy environment

Journal of Phonetics ◽

10.1016/s0095-4470(19)30469-3 ◽

1988 ◽

Vol 16 (1) ◽

pp. 109-123 ◽

Cited By ~ 58

Author(s):

Oded Ghitza

Keyword(s):

Speech Recognition ◽

Auditory Nerve ◽

Noisy Environment ◽

Firing Patterns ◽

Front End

Download Full-text

Feature Extraction for a Speech Recognition System in Noisy Environment: A Study

2010 Second International Conference on Computer Engineering and Applications ◽

10.1109/iccea.2010.76 ◽

2010 ◽

Cited By ~ 2

Author(s):

Urmila Shrawankar ◽

Vilas Thakare

Keyword(s):

Feature Extraction ◽

Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Noisy Environment

Download Full-text

Lip Reading for Robust Speech Recognition on Embedded Devices

Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. ◽

10.1109/icassp.2005.1415153 ◽

2006 ◽

Cited By ~ 4

Author(s):

J.F. Guitarte Perez ◽

A.F. Frangi ◽

E. Lleida Solano ◽

K. Lukas

Keyword(s):

Speech Recognition ◽

Robust Speech Recognition ◽

Embedded Devices ◽

Lip Reading

Download Full-text

Parametric Density Estimation for the Classification of Acoustic Feature Vectors in Speech Recognition

Nonlinear Modeling ◽

10.1007/978-1-4615-5703-6_4 ◽

1998 ◽

pp. 87-118 ◽

Cited By ~ 5

Author(s):

Sankar Basu ◽

Charles A. Micchelli

Keyword(s):

Speech Recognition ◽

Density Estimation ◽

Acoustic Feature ◽

Feature Vectors ◽

Parametric Density

Download Full-text

Audio-Visual Speech Recognition using LIP Movement for Amharic Language

International Journal of Engineering Research and ◽

10.17577/ijertv8is080217 ◽

2019 ◽

Vol V8 (08) ◽

Author(s):

Mr. Befkadu Belete Frew ◽

Keyword(s):

Speech Recognition ◽

Visual Speech ◽

Visual Speech Recognition ◽

Lip Movement

Download Full-text

Research on Speaker Recognition of DRNN in Different Noise Environment

10.21203/rs.3.rs-124941/v1 ◽

2020 ◽

Author(s):

chaofeng lan ◽

yuanyuan Zhang ◽

hongyun Zhao

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Recurrent Neural Network ◽

Speaker Recognition ◽

Signal To Noise Ratio ◽

Recognition Rate ◽

Noisy Environment ◽

Signal To Noise ◽

Noise Ratio ◽

Improved Model

Abstract This paper draws on the training method of Recurrent Neural Network (RNN), By increasing the number of hidden layers of RNN and changing the layer activation function from traditional Sigmoid to Leaky ReLU on the input layer, the first group and the last set of data are zero-padded to enhance the effective utilization of data such that the improved reduction model of Denoise Recurrent Neural Network (DRNN) with high calculation speed and good convergence is constructed to solve the problem of low speaker recognition rate in noisy environment. According to this model, the random semantic speech signal with a sampling rate of 16 kHz and a duration of 5 seconds in the speech library is studied. The experimental settings of the signal-to-noise ratios are − 10dB, -5dB, 0dB, 5dB, 10dB, 15dB, 20dB, 25dB. In the noisy environment, the improved model is used to denoise the Mel Frequency Cepstral Coefficients (MFCC) and the Gammatone Frequency Cepstral Coefficents (GFCC), impact of the traditional model and the improved model on the speech recognition rate is analyzed. The research shows that the improved model can effectively eliminate the noise of the feature parameters and improve the speech recognition rate. When the signal-to-noise ratio is low, the speaker recognition rate can be more obvious. Furthermore, when the signal-to-noise ratio is 0dB, the speaker recognition rate of people is increased by 40%, which can be 85% improved compared with the traditional speech model. On the other hand, with the increase in the signal-to-noise ratio, the recognition rate is gradually increased. When the signal-to-noise ratio is 15dB, the recognition rate of speakers is 93%.

Download Full-text