Measuring the effect of high-speed video data on the audio-visual speech recognition accuracy

Author(s):  
D. V. Ivanko ◽  
D. A. Ryumin ◽  
A. A. Karpov ◽  
M. Zelezny
Author(s):  
PREETY SINGH ◽  
VIJAY LAXMI ◽  
MANOJ SINGH GAUR

To improve the accuracy of visual speech recognition systems, selection of visual features is of fundamental importance. Prominent features, which are of maximum relevance for speech classification, need to be selected from a large set of extracted visual attributes. Existing methods apply feature reduction and selection techniques on image pixels constituting region-of-interest (ROI) to reduce data dimensionality. We propose application of feature selection methods on geometrical features to select the most dominant physical features. Two techniques, Minimum Redundancy Maximum Relevance (mRMR) and Correlation-based Feature Selection (CFS), have been applied on the extracted visual features. Experimental results show that recognition accuracy is not compromised when a few selected features from the complete visual feature set are used for classification, thereby reducing processing time and storage overheads considerably. Results are compared with performance of principal components obtained by application of Principal Component Analysis (PCA) on our dataset. Our set of selected features outperforms the PCA transformed data. Results show that the center and corner segments of the mouth are major contributors to visual speech recognition. Teeth pixels are shown to be a prominent visual cue. It is also seen that lip width contributes more towards visual speech recognition accuracy as compared to lip height.


2020 ◽  
Vol 5 ◽  
pp. 87-93
Author(s):  
A.A. Axyonov ◽  
◽  
D.V. Ivanko ◽  
I.B. Lashkov ◽  
D.A. Ryumin ◽  
...  

This paper introduces a new methodology of multimodal corpus creation for audio-visual speech recognition in driver monitoring systems. Multimodal speech recognition allows using audio data when video data are useless (e.g. at nighttime), as well as applying video data in acoustically noisy conditions (e.g., at highways). The article discusses several basic scenarios when speech recognition in the vehicle environment is required to interact with the driver monitoring system. The methodology defi nes the main stages and requirements for the design of a multimodal building. The paper also describes metaparameters that the multimodal corpus must correspond to. In addition, a software package for recording an audiovisual speech corpus is described.


Author(s):  
Eslam E. El Maghraby ◽  
Amr M. Gody ◽  
M. Hesham Farouk

Background: Multimodal speech recognition is proved to be one of the most promising solutions for robust speech recognition, especially when the audio signal is corrupted by noise. As the visual speech signal not affected by audio noise, it can be used to obtain more information used to enhance the speech recognition accuracy in noisy system. The critical stage in designing robust speech recognition system is choosing of reliable classification method from large variety of available classification techniques. Deep learning is well-known as a technique that has the ability to classify a nonlinear problem, and takes into consideration the sequential characteristic of the speech signal. Numerous researches have been done in applying deep learning to overcome Audio-Visual Speech Recognition (AVSR) problems due to its amazing achievements in both speech and image recognition. Even though optimistic results have been obtained from the continuous studies, researches on enhancing accuracy in noise system and selecting the best classification technique are still gaining lots of attention. Objective: This paper aims to build AVSR system that uses both acoustic combined with visual speech information and use classification technique based on deep learning to improve the recognition performance in a clean and noisy environment. Method: Mel frequency cepstral coefficient (MFCC) and Discrete Cosine Transform (DCT) are used to extract the effective features from audio and visual speech signal respectively. The audio feature rate is greater than the visual feature rate, so that linear interpolation is needed to obtain equal feature vectors size then early integrating them to get combined feature vector. Bidirectional Long-Short Term Memory (BiLSTM), one of the Deep learning techniques, are used for classification process and compare the obtained results to other classification techniques like Convolution Neural Network (CNN) and the traditional Hidden Markov Models (HMM). The effectiveness of the proposed model is proved by using two multi-speaker AVSR datasets termed AV letters and GRID. Results: The proposed model gives promising results where the obtained results In case of GRID, using integrated audio-visual features achieved highest recognition accuracy of 99.07% and 98.47% , with enhancement up to 9.28% and 12.05% over audio-only for clean and noisy data respectively. For AVletters, the highest recognition accuracy is 93.33% with enhancement up to 8.33% over audio-only. Conclusion: Based on the obtained results, we can conclude that increasing the size of audio feature vector from 13 to 39 doesn’t give effective enhancement for the recognition accuracy in clean environment, but in noisy environment, it gives better performance. BiLSTM is considered to be the optimal classifier for a robust speech recognition system when compared to CNN and traditional HMM, because it takes into consideration the sequential characteristic of the speech signal (audio and visual). The proposed model gives great improvement in the recognition accuracy and decreasing the loss value for both clean and noisy environments than using audio-only features. Comparing the proposed model to previously obtain results which using the same datasets, we found that our model gives higher recognition accuracy and confirms the robustness of our model.


2018 ◽  
Vol 12 (4) ◽  
pp. 319-328 ◽  
Author(s):  
Denis Ivanko ◽  
Alexey Karpov ◽  
Dmitrii Fedotov ◽  
Irina Kipyatkova ◽  
Dmitry Ryumin ◽  
...  

2019 ◽  
Vol 85 (6) ◽  
pp. 53-63 ◽  
Author(s):  
I. E. Vasil’ev ◽  
Yu. G. Matvienko ◽  
A. V. Pankov ◽  
A. G. Kalinin

The results of using early damage diagnostics technique (developed in the Mechanical Engineering Research Institute of the Russian Academy of Sciences (IMASH RAN) for detecting the latent damage of an aviation panel made of composite material upon bench tensile tests are presented. We have assessed the capabilities of the developed technique and software regarding damage detection at the early stage of panel loading in conditions of elastic strain of the material using brittle strain-sensitive coating and simultaneous crack detection in the coating with a high-speed video camera “Video-print” and acoustic emission system “A-Line 32D.” When revealing a subsurface defect (a notch of the middle stringer) of the aviation panel, the general concept of damage detection at the early stage of loading in conditions of elastic behavior of the material was also tested in the course of the experiment, as well as the software specially developed for cluster analysis and classification of detected location pulses along with the equipment and software for simultaneous recording of video data flows and arrays of acoustic emission (AE) data. Synchronous recording of video images and AE pulses ensured precise control of the cracking process in the brittle strain-sensitive coating (tensocoating)at all stages of the experiment, whereas the use of structural-phenomenological approach kept track of the main trends in damage accumulation at different structural levels and identify the sources of their origin when classifying recorded AE data arrays. The combined use of oxide tensocoatings and high-speed video recording synchronized with the AE control system, provide the possibility of definite determination of the subsurface defect, reveal the maximum principal strains in the area of crack formation, quantify them and identify the main sources of AE signals upon monitoring the state of the aviation panel under loading P = 90 kN, which is about 12% of the critical load.


Author(s):  
Guillaume Gravier ◽  
Gerasimos Potamianos ◽  
Chalapathy Neti

Sign in / Sign up

Export Citation Format

Share Document