scholarly journals A Chinese Lip-Reading System Based on Convolutional Block Attention Module

2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Yuanyao Lu ◽  
Qi Xiao ◽  
Haiyang Jiang

In recent years, deep learning has already been applied to English lip-reading. However, Chinese lip-reading starts late and lacks relevant dataset, and the recognition accuracy is not ideal. Therefore, this paper proposes a new hybrid neural network model to establish a Chinese lip-reading system. In this paper, we integrate the attention mechanism into both CNN and RNN. Specifically, we add the convolutional block attention module (CBAM) to the ResNet50 neural network, which enhances its ability to capture the small differences among the mouth patterns of similarly pronounced words in Chinese, improving the performance of feature extraction in the convolution process. We also add the time attention mechanism to the GRU neural network, which helps to extract the features among consecutive lip motion images. Considering the effects of the moments before and after on the current moment in the lip-reading process, we assign more weights to the key frames, which makes the features more representative. We further validate our model through experiments on our self-built dataset. Our experiments show that using convolutional block attention module (CBAM) in the Chinese lip-reading model can accurately recognize Chinese numbers 0–9 and some frequently used Chinese words. Compared with other lip-reading systems, our system has better performance and higher recognition accuracy.

2021 ◽  
Vol 2021 ◽  
pp. 1-7
Author(s):  
Boting Geng

Research on relation extraction from patent documents, a high-priority topic of natural language process in recent years, is of great significance to a series of patent downstream applications, such as patent content mining, patent retrieval, and patent knowledge base constructions. Due to lengthy sentences, crossdomain technical terms, and complex structure of patent claims, it is extremely difficult to extract open triples with traditional methods of Natural Language Processing (NLP) parsers. In this paper, we propose an Open Relation Extraction (ORE) approach with transforming relation extraction problem into sequence labeling problem in patent claims, which extract none predefined relationship triples from patent claims with a hybrid neural network architecture based on multihead attention mechanism. The hybrid neural network framework combined with Bi-LSTM and CNN is proposed to extract argument phrase features and relation phrase features simultaneously. The Bi-LSTM network gains long distance dependency features, and the CNN obtains local content feature; then, multihead attention mechanism is applied to get potential dependency relationship for time series of RNN model; the result of neural network proposed above applied to our constructed open patent relation dataset shows that our method outperforms both traditional classification algorithms of machine learning and the-state-of-art neural network classification models in the measures of Precision, Recall, and F1.


2019 ◽  
Vol 9 (24) ◽  
pp. 5432
Author(s):  
Jing Wen ◽  
Yuanyao Lu

Virtual Reality (VR) is a kind of interactive experience technology. Human vision, hearing, expression, voice and even touch can be added to the interaction between humans and machine. Lip reading recognition is a new technology in the field of human-computer interaction, which has a broad development prospect. It is particularly important in a noisy environment and within the hearing- impaired population and is obtained by means of visual information from a video to make up for the deficiency of voice information. This information is a visual language that benefits from Augmented Reality (AR). The purpose is to establish an efficient and convenient way of communication. However, the traditional lip reading recognition system has high requirements of running speed and performance of the equipment because of its long recognition process and large number of parameters, so it is difficult to meet the requirements of practical application. In this paper, the mobile end lip-reading recognition system based on Raspberry Pi is implemented for the first time, and the recognition application has reached the latest level of our research. Our mobile lip-reading recognition system can be divided into three stages: First, we extract key frames from our own independent database, and then use a multi-task cascade convolution network (MTCNN) to correct the face, so as to improve the accuracy of lip extraction. In the second stage, we use MobileNets to extract lip image features and long short-term memory (LSTM) to extract sequence information between key frames. Finally, we compare three lip reading models: (1) The fusion model of Bi-LSTM and AlexNet. (2) A fusion model with attention mechanism. (3) The LSTM and MobileNets hybrid network model proposed by us. The results show that our model has fewer parameters and lower complexity. The accuracy of the model in the test dataset is 86.5%. Therefore, our mobile lip reading system is simpler and smaller than other PC platforms and saves computing resources and memory space.


Author(s):  
Nikita Markovnikov ◽  
Irina Kipyatkova

Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markovmodels and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independentcomplex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, usingdeep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end modelsoften demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toendmodels for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms ofrecognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecodermodel of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neuralnetworks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decodermodel was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or usinga language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could notsurpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than thebaseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without languagemodels for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained onraw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives thebest result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require lessmemory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile deviceswithout using calculations on remote servers.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Feng-Ping An ◽  
Jun-e Liu ◽  
Lei Bai

Pedestrian reidentification is a key technology in large-scale distributed camera systems. It can quickly and efficiently detect and track target people in large-scale distributed surveillance networks. The existing traditional pedestrian reidentification methods have problems such as low recognition accuracy, low calculation efficiency, and weak adaptive ability. Pedestrian reidentification algorithms based on deep learning have been widely used in the field of pedestrian reidentification due to their strong adaptive ability and high recognition accuracy. However, the pedestrian recognition method based on deep learning has the following problems: first, during the learning process of the deep learning model, the initial value of the convolution kernel is usually randomly assigned, which makes the model learning process easily fall into a local optimum. The second is that the model parameter learning method based on the gradient descent method exhibits gradient dispersion. The third is that the information transfer of pedestrian reidentification sequence images is not considered. In view of these issues, this paper first examines the feature map matrix from the original image through a deconvolution neural network, uses it as a convolution kernel, and then performs layer-by-layer convolution and pooling operations. Then, the second derivative information of the error function is directly obtained without calculating the Hessian matrix, and the momentum coefficient is used to improve the convergence of the backpropagation, thereby suppressing the gradient dispersion phenomenon. At the same time, to solve the problem of information transfer of pedestrian reidentification sequence images, this paper proposes a memory network model based on a multilayer attention mechanism, which uses the network to effectively store image visual information and pedestrian behavior information, respectively. It can solve the problem of information transmission. Based on the above ideas, this paper proposes a pedestrian reidentification algorithm based on deconvolution network feature extraction-multilayer attention mechanism convolutional neural network. Experiments are performed on the related data sets using this algorithm and other major popular human reidentification algorithms. The results show that the pedestrian reidentification method proposed in this paper not only has strong adaptive ability but also has significantly improved average recognition accuracy and rank-1 matching rate compared with other mainstream methods.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Qiang Duan ◽  
Jianhua Fan ◽  
Xianglin Wei ◽  
Chao Wang ◽  
Xiang Jiao ◽  
...  

Recognizing signals is critical for understanding the increasingly crowded wireless spectrum space in noncooperative communications. Traditional threshold or pattern recognition-based solutions are labor-intensive and error-prone. Therefore, practitioners start to apply deep learning to automatic modulation classification (AMC). However, the recognition accuracy and robustness of recently presented neural network-based proposals are still unsatisfactory, especially when the signal-to-noise ratio (SNR) is low. In this backdrop, this paper presents a hybrid neural network model, called MCBL, which combines convolutional neural network, bidirectional long-short time memory, and attention mechanism to exploit their respective capability to extract the spatial, temporal, and salient features embedded in the signal samples. After formulating the AMC problem, the three modules of our hybrid dynamic neural network are detailed. To evaluate the performance of our proposal, 10 state-of-the-art neural networks (including two latest models) are chosen as benchmarks for the comparison experiments conducted on an open radio frequency (RF) dataset. Results have shown that the recognition accuracy of MCBL can reach 93% which is the highest among the tested DNN models. At the same time, the computation efficiency and robustness of MCBL are better than existing proposals.


2021 ◽  
Vol 70 (1) ◽  
pp. 010501-010501
Author(s):  
Huang Wei-Jian ◽  
◽  
Li Yong-Tao ◽  
Huang Yuan

2021 ◽  
Vol 13 (23) ◽  
pp. 4820
Author(s):  
Xiaoxu Liu ◽  
Weihua Bai ◽  
Junming Xia ◽  
Feixiong Huang ◽  
Cong Yin ◽  
...  

Based on deep learning, this paper proposes a new hybrid neural network model, a recurrent deep neural network using a feature attention mechanism (FA-RDN) for GNSS-R global sea surface wind speed retrieval. FA-RDN can process data from the Cyclone Global Navigation Satellite System (CYGNSS) satellite mission, including characteristics of the signal, spatio-temporal, geometry, and instrument. FA-RDN can receive data extended in temporal dimension and mine the temporal correlation information of features through the long-short term memory (LSTM) neural network layer. A feature attention mechanism is also added to improve the model’s computational efficiency. To evaluate the model performance, we designed comparison and validation experiments for the retrieval accuracy, enhancement effect, and stability of FA-RDN by comparing the evaluation criteria results. The results show that the wind speed retrieval root mean square error (RMSE) of the FA-RDN model can reach 1.45 m/s, 10.38%, 6.58%, 13.28%, 17.89%, 20.26%, and 23.14% higher than that of Backpropagation Neural Network (BPNN), Recurrent Neural Network (RNN), Artificial Neural Network (ANN), Random Forests (RF), eXtreme Gradient Boosting (XGBoost), and Support Vector Regression (SVR), respectively, confirming the feasibility and effectiveness of the designed method. At the same time, the designed model has better stability and applicability, serving as a new research idea of data mining and feature selection, as well as a reference model for GNSS-R-based sea surface wind speed retrieval.


2019 ◽  
Vol 9 (8) ◽  
pp. 1599 ◽  
Author(s):  
Yuanyao Lu ◽  
Hongbo Li

With the improvement of computer performance, virtual reality (VR) as a new way of visual operation and interaction method gives the automatic lip-reading technology based on visual features broad development prospects. In an immersive VR environment, the user’s state can be successfully captured through lip movements, thereby analyzing the user’s real-time thinking. Due to complex image processing, hard-to-train classifiers and long-term recognition processes, the traditional lip-reading recognition system is difficult to meet the requirements of practical applications. In this paper, the convolutional neural network (CNN) used to image feature extraction is combined with a recurrent neural network (RNN) based on attention mechanism for automatic lip-reading recognition. Our proposed method for automatic lip-reading recognition can be divided into three steps. Firstly, we extract keyframes from our own established independent database (English pronunciation of numbers from zero to nine by three males and three females). Then, we use the Visual Geometry Group (VGG) network to extract the lip image features. It is found that the image feature extraction results are fault-tolerant and effective. Finally, we compare two lip-reading models: (1) a fusion model with an attention mechanism and (2) a fusion model of two networks. The results show that the accuracy of the proposed model is 88.2% in the test dataset and 84.9% for the contrastive model. Therefore, our proposed method is superior to the traditional lip-reading recognition methods and the general neural networks.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Ruochen Lu ◽  
Muchao Lu

With the return of deep learning methods to the public eye, more and more scholars and industry researchers have tried to start exploring the possibility of neural networks to solve the problem, and some progress has been made. However, although neural networks have powerful function fitting ability, they are often criticized for their lack of explanatory power. Due to the large number of parameters and complex structure of neural network models, academics are unable to explain the predictive logic of most neural networks, test the significance of model parameters, and summarize the laws that humans can understand and use. Inspired by the technical analysis theory in the field of stock investment, this paper selects neural network models with different characteristics and extracts effective feature combinations from short-term stock price fluctuation data. In addition, on the basis of ensuring that the prediction effect of the model is not lower than that of the mainstream models, this paper uses the attention mechanism to further explore the predictive K -line patterns, which summarizes usable judgment experience for human researchers on the one hand and explains the prediction logic of the hybrid neural network on the other. Experiments show that the classification effect is better using this model, and the investor sentiment is obtained more accurately, and the accuracy rate can reach 85%, which lays the foundation for the establishment of the whole stock trend prediction model. In terms of explaining the prediction logic of the model, it is experimentally demonstrated that the K -line patterns mined using the attention mechanism have more significant predictive power than the general K -line patterns, and this result explains the prediction basis of the hybrid neural network.


Sign in / Sign up

Export Citation Format

Share Document