Temporal Attention Mechanism with Conditional Inference for Large-Scale Multi-label Video Classification

Author(s):  
Eun-Sol Kim ◽  
Kyoung-Woon On ◽  
Jongseok Kim ◽  
Yu-Jung Heo ◽  
Seong-Ho Choi ◽  
...  
2021 ◽  
Vol 12 (4) ◽  
pp. 79-97
Author(s):  
Zengkai Wang

Video classification has been an active research field of computer vision in last few years. Its main purpose is to produce a label that is relevant to the video given its frames. Unlike image classification, which takes still pictures as input, the input of video classification is a sequence of images. The complex spatial and temporal structures of video sequence incur understanding and computation difficulties, which should be modeled to improve the video classification performance. This work focuses on sports video classification but can be expanded into other applications. In this paper, the authors propose a novel sports video classification method by processing the video data using convolutional neural network (CNN) with spatial attention mechanism and deep bidirectional long short-term memory (BiLSTM) network with temporal attention mechanism. The method first extracts 28 frames from each input video and uses the classical pre-trained CNN to extract deep features, and the spatial attention mechanism is applied to CNN features to decide ‘where' to look. Then the BiLSTM is utilized to model the long-term temporal dependence between video frame sequences, and the temporal attention mechasim is employed to decide ‘when' to look. Finally, the label of the input video is given by the classification network. In order to evaluate the feasibility and effectiveness of the proposed method, an extensive experimental investigation was conducted on the open challenging sports video datasets of Sports8 and Olympic16; the results show that the proposed CNN-BiLSTM network with spatial temporal attention mechanism can effectively model the spatial-temporal characteristics of video sequences. The average classification accuracy of the Sports8 is 98.8%, which is 6.8% higher than the existing method. The average classification accuracy of 90.46% is achieved on Olympic16, which is about 18% higher than the existing methods. The performance of the proposed approach outperforms the state-of-the-art methods, and the experimental results demonstrate the effectiveness of the proposed approach.


2021 ◽  
Vol 13 (5) ◽  
pp. 905
Author(s):  
Chuyi Wu ◽  
Feng Zhang ◽  
Junshi Xia ◽  
Yichen Xu ◽  
Guoqing Li ◽  
...  

The building damage status is vital to plan rescue and reconstruction after a disaster and is also hard to detect and judge its level. Most existing studies focus on binary classification, and the attention of the model is distracted. In this study, we proposed a Siamese neural network that can localize and classify damaged buildings at one time. The main parts of this network are a variety of attention U-Nets using different backbones. The attention mechanism enables the network to pay more attention to the effective features and channels, so as to reduce the impact of useless features. We train them using the xBD dataset, which is a large-scale dataset for the advancement of building damage assessment, and compare their result balanced F (F1) scores. The score demonstrates that the performance of SEresNeXt with an attention mechanism gives the best performance, with the F1 score reaching 0.787. To improve the accuracy, we fused the results and got the best overall F1 score of 0.792. To verify the transferability and robustness of the model, we selected the dataset on the Maxar Open Data Program of two recent disasters to investigate the performance. By visual comparison, the results show that our model is robust and transferable.


Author(s):  
Huiqun Huang ◽  
Xi Yang ◽  
Suining He

Timely forecasting the urban anomaly events in advance is of great importance to the city management and planning. However, anomaly event prediction is highly challenging due to the sparseness of data, geographic heterogeneity (e.g., complex spatial correlation, skewed spatial distribution of anomaly events and crowd flows), and the dynamic temporal dependencies. In this study, we propose M-STAP, a novel Multi-head Spatio-Temporal Attention Prediction approach to address the problem of multi-region urban anomaly event prediction. Specifically, M-STAP considers the problem from three main aspects: (1) extracting the spatial characteristics of the anomaly events in different regions, and the spatial correlations between anomaly events and crowd flows; (2) modeling the impacts of crowd flow dynamic of the most relevant regions in each time step on the anomaly events; and (3) employing attention mechanism to analyze the varying impacts of the historical anomaly events on the predicted data. We have conducted extensive experimental studies on the crowd flows and anomaly events data of New York City, Melbourne and Chicago. Our proposed model shows higher accuracy (41.91% improvement on average) in predicting multi-region anomaly events compared with the state-of-the-arts.


2021 ◽  
pp. 016173462110425
Author(s):  
Jianing Xi ◽  
Jiangang Chen ◽  
Zhao Wang ◽  
Dean Ta ◽  
Bing Lu ◽  
...  

Large scale early scanning of fetuses via ultrasound imaging is widely used to alleviate the morbidity or mortality caused by congenital anomalies in fetal hearts and lungs. To reduce the intensive cost during manual recognition of organ regions, many automatic segmentation methods have been proposed. However, the existing methods still encounter multi-scale problem at a larger range of receptive fields of organs in images, resolution problem of segmentation mask, and interference problem of task-irrelevant features, obscuring the attainment of accurate segmentations. To achieve semantic segmentation with functions of (1) extracting multi-scale features from images, (2) compensating information of high resolution, and (3) eliminating the task-irrelevant features, we propose a multi-scale model with skip connection framework and attention mechanism integrated. The multi-scale feature extraction modules are incorporated with additive attention gate units for irrelevant feature elimination, through a U-Net framework with skip connections for information compensation. The performance of fetal heart and lung segmentation indicates the superiority of our method over the existing deep learning based approaches. Our method also shows competitive performance stability during the task of semantic segmentations, showing a promising contribution on ultrasound based prognosis of congenital anomaly in the early intervention, and alleviating the negative effects caused by congenital anomaly.


2020 ◽  
Vol 14 (3) ◽  
pp. 320-328
Author(s):  
Long Guo ◽  
Lifeng Hua ◽  
Rongfei Jia ◽  
Fei Fang ◽  
Binqiang Zhao ◽  
...  

With the rapid growth of e-commerce in recent years, e-commerce platforms are becoming a primary place for people to find, compare and ultimately purchase products. To improve online shopping experience for consumers and increase sales for sellers, it is important to understand user intent accurately and be notified of its change timely. In this way, the right information could be offered to the right person at the right time. To achieve this goal, we propose a unified deep intent prediction network, named EdgeDIPN, which is deployed at the edge, i.e., mobile device, and able to monitor multiple user intent with different granularity simultaneously in real-time. We propose to train EdgeDIPN with multi-task learning, by which EdgeDIPN can share representations between different tasks for better performance and saving edge resources in the meantime. In particular, we propose a novel task-specific attention mechanism which enables different tasks to pick out the most relevant features from different data sources. To extract the shared representations more effectively, we utilize two kinds of attention mechanisms, where the multi-level attention mechanism tries to identify the important actions within each data source and the inter-view attention mechanism learns the interactions between different data sources. In the experiments conducted on a large-scale industrial dataset, EdgeDIPN significantly outperforms the baseline solutions. Moreover, EdgeDIPN has been deployed in the operational system of Alibaba. Online A/B testing results in several business scenarios reveal the potential of monitoring user intent in real-time. To the best of our knowledge, EdgeDIPN is the first full-fledged real-time user intent understanding center deployed at the edge and serving hundreds of millions of users in a large-scale e-commerce platform.


Author(s):  
Yongyi Tang ◽  
Lin Ma ◽  
Lianqiang Zhou

Appearance and motion are two key components to depict and characterize the video content. Currently, the two-stream models have achieved state-of-the-art performances on video classification. However, extracting motion information, specifically in the form of optical flow features, is extremely computationally expensive, especially for large-scale video classification. In this paper, we propose a motion hallucination network, namely MoNet, to imagine the optical flow features from the appearance features, with no reliance on the optical flow computation. Specifically, MoNet models the temporal relationships of the appearance features and exploits the contextual relationships of the optical flow features with concurrent connections. Extensive experimental results demonstrate that the proposed MoNet can effectively and efficiently hallucinate the optical flow features, which together with the appearance features consistently improve the video classification performances. Moreover, MoNet can help cutting down almost a half of computational and data-storage burdens for the two-stream video classification. Our code is available at: https://github.com/YongyiTang92/MoNet-Features


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Haibo Pang ◽  
Qi Xuan ◽  
Meiqin Xie ◽  
Chengming Liu ◽  
Zhanbo Li

Target tracking is a significant topic in the field of computer vision. In this paper, the target tracking algorithm based on deep Siamese network is studied. Aiming at the situation that the tracking process is not robust, such as drift or miss the target, the tracking accuracy and robustness of the algorithm are improved by improving the feature extraction part and online update part. This paper adds SE-block and temporal attention mechanism (TAM) to the framework of Siamese neural network. SE-block can refine and extract features; different channels are given different weights according to their importance which can improve the discrimination of the network and the recognition ability of the tracker. Temporal attention mechanism can update the target state by adjusting the weights of samples at current frame and historical frame to solve the model drift caused by the existence of similar background. We use cross-entropy loss to distinguish the targets in different sequences so that their distance in the feature domains is longer and the features are easier to identify. We train and test the network on three benchmarks and compare with several state-of-the-art tracking methods. The experimental results demonstrate that the algorithm proposed is superior to other methods in tracking effect diagram and evaluation criteria. The proposed algorithm can solve the occlusion problem effectively while ensuring the real-time performance in the process of tracking.


Author(s):  
Hehe Fan ◽  
Zhongwen Xu ◽  
Linchao Zhu ◽  
Chenggang Yan ◽  
Jianjun Ge ◽  
...  

We aim to significantly reduce the computational cost for classification of temporally untrimmed videos while retaining similar accuracy. Existing video classification methods sample frames with a predefined frequency over entire video. Differently, we propose an end-to-end deep reinforcement approach which enables an agent to classify videos by watching a very small portion of frames like what we do. We make two main contributions. First, information is not equally distributed in video frames along time. An agent needs to watch more carefully when a clip is informative and skip the frames if they are redundant or irrelevant. The proposed approach enables the agent to adapt sampling rate to video content and skip most of the frames without the loss of information. Second, in order to have a confident decision, the number of frames that should be watched by an agent varies greatly from one video to another. We incorporate an adaptive stop network to measure confidence score and generate timely trigger to stop the agent watching videos, which improves efficiency without loss of accuracy. Our approach reduces the computational cost significantly for the large-scale YouTube-8M dataset, while the accuracy remains the same.


2019 ◽  
Vol 29 (11n12) ◽  
pp. 1727-1740 ◽  
Author(s):  
Hongming Zhu ◽  
Yi Luo ◽  
Qin Liu ◽  
Hongfei Fan ◽  
Tianyou Song ◽  
...  

Multistep flow prediction is an essential task for the car-sharing systems. An accurate flow prediction model can help system operators to pre-allocate the cars to meet the demand of users. However, this task is challenging due to the complex spatial and temporal relations among stations. Existing works only considered temporal relations (e.g. using LSTM) or spatial relations (e.g. using CNN) independently. In this paper, we propose an attention to multi-graph convolutional sequence-to-sequence model (AMGC-Seq2Seq), which is a novel deep learning model for multistep flow prediction. The proposed model uses the encoder–decoder architecture, wherein the encoder part, spatial and temporal relations are encoded simultaneously. Then the encoded information is passed to the decoder to generate multistep outputs. In this work, specific multiple graphs are constructed to reflect spatial relations from different aspects, and we model them by using the proposed multi-graph convolution. Attention mechanism is also used to capture the important relations from previous information. Experiments on a large-scale real-world car-sharing dataset demonstrate the effectiveness of our approach over state-of-the-art methods.


Sign in / Sign up

Export Citation Format

Share Document