scholarly journals Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

2020 ◽  
Vol 34 (04) ◽  
pp. 6917-6924 ◽  
Author(s):  
Ya Zhao ◽  
Rui Xu ◽  
Xinchao Wang ◽  
Peng Hou ◽  
Haihong Tang ◽  
...  

Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of large-scale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the one of its counterpart speech recognition, due to the ambiguous nature of its actuations that makes it challenging to extract discriminant features from the lip movement videos. In this paper, we propose a new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen lip reading by learning from speech recognizers. The rationale behind our approach is that the features extracted from speech recognizers may provide complementary and discriminant clues, which are formidable to be obtained from the subtle movements of the lips, and consequently facilitate the training of lip readers. This is achieved, specifically, by distilling multi-granularity knowledge from speech recognizers to lip readers. To conduct this cross-modal knowledge distillation, we utilize an efficacious alignment scheme to handle the inconsistent lengths of the audios and videos, as well as an innovative filtering strategy to refine the speech recognizer's prediction. The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7.66% and 2.75% in character error rate, respectively.

Author(s):  
Yuheng Hu ◽  
Yili Hong

Residents often rely on newspapers and television to gather hyperlocal news for community awareness and engagement. More recently, social media have emerged as an increasingly important source of hyperlocal news. Thus far, the literature on using social media to create desirable societal benefits, such as civic awareness and engagement, is still in its infancy. One key challenge in this research stream is to timely and accurately distill information from noisy social media data streams to community members. In this work, we develop SHEDR (social media–based hyperlocal event detection and recommendation), an end-to-end neural event detection and recommendation framework with a particular use case for Twitter to facilitate residents’ information seeking of hyperlocal events. The key model innovation in SHEDR lies in the design of the hyperlocal event detector and the event recommender. First, we harness the power of two popular deep neural network models, the convolutional neural network (CNN) and long short-term memory (LSTM), in a novel joint CNN-LSTM model to characterize spatiotemporal dependencies for capturing unusualness in a region of interest, which is classified as a hyperlocal event. Next, we develop a neural pairwise ranking algorithm for recommending detected hyperlocal events to residents based on their interests. To alleviate the sparsity issue and improve personalization, our algorithm incorporates several types of contextual information covering topic, social, and geographical proximities. We perform comprehensive evaluations based on two large-scale data sets comprising geotagged tweets covering Seattle and Chicago. We demonstrate the effectiveness of our framework in comparison with several state-of-the-art approaches. We show that our hyperlocal event detection and recommendation models consistently and significantly outperform other approaches in terms of precision, recall, and F-1 scores. Summary of Contribution: In this paper, we focus on a novel and important, yet largely underexplored application of computing—how to improve civic engagement in local neighborhoods via local news sharing and consumption based on social media feeds. To address this question, we propose two new computational and data-driven methods: (1) a deep learning–based hyperlocal event detection algorithm that scans spatially and temporally to detect hyperlocal events from geotagged Twitter feeds; and (2) A personalized deep learning–based hyperlocal event recommender system that systematically integrates several contextual cues such as topical, geographical, and social proximity to recommend the detected hyperlocal events to potential users. We conduct a series of experiments to examine our proposed models. The outcomes demonstrate that our algorithms are significantly better than the state-of-the-art models and can provide users with more relevant information about the local neighborhoods that they live in, which in turn may boost their community engagement.


2020 ◽  
Vol 8 (5) ◽  
pp. 3978-3983

Identification of a person’s speech by his lip movement is a challenging task. Even though many software tools available for recognition of speech to text and vice versa, some of the words uttered may not be accurate as spoken and may vary from person to person because of their pronunciation. In addition, in the noisy environment speech uttered may not perceive effectively hence there lip movement for a given speech varies. Lip reading has added advantages when it augmented with speech recognition, thus increasing the perceived information. In this paper, the video file of a individual person are converted to frames and extraction of only the lip contour for vowels is done by calculating its area and other geometrical aspects. Once this is done as a part of testing it is compared with three to four people’s lip contour for vowels for first 20 frames. The parameters such as mean, centroid will remain approximately same for all people irrespective of their lip movement but there is change in major and minor axis and hence area changes considerably. In audio domain vowel detection is carried out by extracting unique features of English vowel utterance using Mel Frequency Cepstrum Coefficients (MFCC) and the feature vectors that are orthonormalized to compare the normalized vectors with standard database and results are obtained with approximation.


2019 ◽  
Vol 11 (9) ◽  
pp. 190 ◽  
Author(s):  
Jamal ◽  
Xianqiao ◽  
Aldabbas

Emotions detection in social media is very effective to measure the mood of people about a specific topic, news, or product. It has a wide range of applications, including identifying psychological conditions such as anxiety or depression in users. However, it is a challenging task to distinguish useful emotions’ features from a large corpus of text because emotions are subjective, with limited fuzzy boundaries that may be expressed in different terminologies and perceptions. To tackle this issue, this paper presents a hybrid approach of deep learning based on TensorFlow with Keras for emotions detection on a large scale of imbalanced tweets’ data. First, preprocessing steps are used to get useful features from raw tweets without noisy data. Second, the entropy weighting method is used to compute the importance of each feature. Third, class balancer is applied to balance each class. Fourth, Principal Component Analysis (PCA) is applied to transform high correlated features into normalized forms. Finally, the TensorFlow based deep learning with Keras algorithm is proposed to predict high-quality features for emotions classification. The proposed methodology is analyzed on a dataset of 1,600,000 tweets collected from the website ‘kaggle’. Comparison is made of the proposed approach with other state of the art techniques on different training ratios. It is proved that the proposed approach outperformed among other techniques.


Author(s):  
Olov Engwall ◽  
José Lopes ◽  
Ronald Cumbal

AbstractThe large majority of previous work on human-robot conversations in a second language has been performed with a human wizard-of-Oz. The reasons are that automatic speech recognition of non-native conversational speech is considered to be unreliable and that the dialogue management task of selecting robot utterances that are adequate at a given turn is complex in social conversations. This study therefore investigates if robot-led conversation practice in a second language with pairs of adult learners could potentially be managed by an autonomous robot. We first investigate how correct and understandable transcriptions of second language learner utterances are when made by a state-of-the-art speech recogniser. We find both a relatively high word error rate (41%) and that a substantial share (42%) of the utterances are judged to be incomprehensible or only partially understandable by a human reader. We then evaluate how adequate the robot utterance selection is, when performed manually based on the speech recognition transcriptions or autonomously using (a) predefined sequences of robot utterances, (b) a general state-of-the-art language model that selects utterances based on learner input or the preceding robot utterance, or (c) a custom-made statistical method that is trained on observations of the wizard’s choices in previous conversations. It is shown that adequate or at least acceptable robot utterances are selected by the human wizard in most cases (96%), even though the ASR transcriptions have a high word error rate. Further, the custom-made statistical method performs as well as manual selection of robot utterances based on ASR transcriptions. It was also found that the interaction strategy that the robot employed, which differed regarding how much the robot maintained the initiative in the conversation and if the focus of the conversation was on the robot or the learners, had marginal effects on the word error rate and understandability of the transcriptions but larger effects on the adequateness of the utterance selection. Autonomous robot-led conversations may hence work better with some robot interaction strategies.


Author(s):  
Robinson Jiménez-Moreno ◽  
Javier Orlando Pinzón-Arenas ◽  
César Giovany Pachón-Suescún

This article presents a work oriented to assistive robotics, where a scenario is established for a robot to reach a tool in the hand of a user, when they have verbally requested it by his name. For this, three convolutional neural networks are trained, one for recognition of a group of tools, which obtained an accuracy of 98% identifying the tools established for the application, that are scalpel, screwdriver and scissors; one for speech recognition, trained with the names of the tools in Spanish language, where its validation accuracy reach a 97.5% in the recognition of the words; and another for recognition of the user's hand, taking in consideration the classification of 2 gestures: Open and Closed hand, where a 96.25% accuracy was achieved. With those networks, tests in real time are performed, presenting results in the delivery of each tool with a 100% of accuracy, i.e. the robot was able to identify correctly what the user requested, recognize correctly each tool and deliver the one need when the user opened their hand, taking an average time of 45 seconds in the execution of the application.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Hao Hu ◽  
Mengya Gao ◽  
Mingsheng Wu

In the real-world scenario, data often have a long-tailed distribution and training deep neural networks on such an imbalanced dataset has become a great challenge. The main problem caused by a long-tailed data distribution is that common classes will dominate the training results and achieve a very low accuracy on the rare classes. Recent work focuses on improving the network representation ability to overcome the long-tailed problem, while it always ignores adapting the network classifier to a long-tailed case, which will cause the “incompatibility” problem of network representation and network classifier. In this paper, we use knowledge distillation to solve the long-tailed data distribution problem and fully optimize the network representation and classifier simultaneously. We propose multiexperts knowledge distillation with class-balanced sampling to jointly learn high-quality network representation and classifier. Also, a channel activation-based knowledge distillation method is also proposed to improve the performance further. State-of-the-art performance on several large-scale long-tailed classification datasets shows the superior generalization of our method.


2021 ◽  
Author(s):  
Manuel Fritz ◽  
Michael Behringer ◽  
Dennis Tschechlov ◽  
Holger Schwarz

AbstractClustering is a fundamental primitive in manifold applications. In order to achieve valuable results in exploratory clustering analyses, parameters of the clustering algorithm have to be set appropriately, which is a tremendous pitfall. We observe multiple challenges for large-scale exploration processes. On the one hand, they require specific methods to efficiently explore large parameter search spaces. On the other hand, they often exhibit large runtimes, in particular when large datasets are analyzed using clustering algorithms with super-polynomial runtimes, which repeatedly need to be executed within exploratory clustering analyses. We address these challenges as follows: First, we present LOG-Means and show that it provides estimates for the number of clusters in sublinear time regarding the defined search space, i.e., provably requiring less executions of a clustering algorithm than existing methods. Second, we demonstrate how to exploit fundamental characteristics of exploratory clustering analyses in order to significantly accelerate the (repetitive) execution of clustering algorithms on large datasets. Third, we show how these challenges can be tackled at the same time. To the best of our knowledge, this is the first work which simultaneously addresses the above-mentioned challenges. In our comprehensive evaluation, we unveil that our proposed methods significantly outperform state-of-the-art methods, thus especially supporting novice analysts for exploratory clustering analyses in large-scale exploration processes.


2021 ◽  
Vol 13 (22) ◽  
pp. 4599
Author(s):  
Félix Quinton ◽  
Loic Landrieu

While annual crop rotations play a crucial role for agricultural optimization, they have been largely ignored for automated crop type mapping. In this paper, we take advantage of the increasing quantity of annotated satellite data to propose to model simultaneously the inter- and intra-annual agricultural dynamics of yearly parcel classification with a deep learning approach. Along with simple training adjustments, our model provides an improvement of over 6.3% mIoU over the current state-of-the-art of crop classification, and a reduction of over 21% of the error rate. Furthermore, we release the first large-scale multi-year agricultural dataset with over 300,000 annotated parcels.


2018 ◽  
Vol 9 (24) ◽  
pp. 5441-5451 ◽  
Author(s):  
Andreas Mayr ◽  
Günter Klambauer ◽  
Thomas Unterthiner ◽  
Marvin Steijaert ◽  
Jörg K. Wegner ◽  
...  

The to date largest comparative study of nine state-of-the-art drug target prediction methods finds that deep learning outperforms all other competitors. The results are based on a benchmark of 1300 assays and half a million compounds.


2020 ◽  
Vol 12 (17) ◽  
pp. 2770 ◽  
Author(s):  
Yajie Chai ◽  
Kun Fu ◽  
Xian Sun ◽  
Wenhui Diao ◽  
Zhiyuan Yan ◽  
...  

The deep convolutional neural network has made significant progress in cloud detection. However, the compromise between having a compact model and high accuracy has always been a challenging task in cloud detection for large-scale remote sensing imagery. A promising method to tackle this problem is knowledge distillation, which usually lets the compact model mimic the cumbersome model’s output to get better generalization. However, vanilla knowledge distillation methods cannot properly distill the characteristics of clouds in remote sensing images. In this paper, we propose a novel self-attention knowledge distillation approach for compact and accurate cloud detection, named Bidirectional Self-Attention Distillation (Bi-SAD). Bi-SAD lets a model learn from itself without adding additional parameters or supervision. With bidirectional layer-wise features learning, the model can get a better representation of the cloud’s textural information and semantic information, so that the cloud’s boundaries become more detailed and the predictions become more reliable. Experiments on a dataset acquired by GaoFen-1 satellite show that our Bi-SAD has a great balance between compactness and accuracy, and outperforms vanilla distillation methods. Compared with state-of-the-art cloud detection models, the parameter size and FLOPs are reduced by 100 times and 400 times, respectively, with a small drop in accuracy.


Sign in / Sign up

Export Citation Format

Share Document