Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

We propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich spatial-temporal representations. VCP first generates “blanks” by withholding video clips and then creates “options” by applying spatio-temporal operations on the withheld clips. Finally, it fills the blanks with “options” and learns representations by predicting the categories of operations applied on the clips. VCP can act as either a proxy task or a target task in self-supervised learning. As a proxy task, it converts rich self-supervised representations into video clip operations (options), which enhances the flexibility and reduces the complexity of representation learning. As a target task, it can assess learned representation models in a uniform and interpretable manner. With VCP, we train spatial-temporal representation models (3D-CNNs) and apply such models on action recognition and video retrieval tasks. Experiments on commonly used benchmarks show that the trained models outperform the state-of-the-art self-supervised models with significant margins.

Download Full-text

App2Vec: Context-Aware Application Usage Prediction

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3451396 ◽

2021 ◽

Vol 15 (6) ◽

pp. 1-21

Author(s):

Huandong Wang ◽

Yong Li ◽

Mu Du ◽

Zhenhui Li ◽

Depeng Jin

Keyword(s):

Dirichlet Process ◽

Service Providers ◽

State Of The Art ◽

Representation Learning ◽

Context Aware ◽

Challenging Problem ◽

Performance Gap ◽

Bayesian Mixture Model ◽

Bayesian Mixture ◽

Spatio Temporal

Both app developers and service providers have strong motivations to understand when and where certain apps are used by users. However, it has been a challenging problem due to the highly skewed and noisy app usage data. Moreover, apps are regarded as independent items in existing studies, which fail to capture the hidden semantics in app usage traces. In this article, we propose App2Vec, a powerful representation learning model to learn the semantic embedding of apps with the consideration of spatio-temporal context. Based on the obtained semantic embeddings, we develop a probabilistic model based on the Bayesian mixture model and Dirichlet process to capture when , where , and what semantics of apps are used to predict the future usage. We evaluate our model using two different app usage datasets, which involve over 1.7 million users and 2,000+ apps. Evaluation results show that our proposed App2Vec algorithm outperforms the state-of-the-art algorithms in app usage prediction with a performance gap of over 17.0%.

Download Full-text

State-of-the-art on spatio-temporal information-based video retrieval

Pattern Recognition ◽

10.1016/j.patcog.2008.08.033 ◽

2009 ◽

Vol 42 (2) ◽

pp. 267-282 ◽

Cited By ~ 31

Author(s):

W. Ren ◽

S. Singh ◽

M. Singh ◽

Y.S. Zhu

Keyword(s):

Video Retrieval ◽

State Of The Art ◽

Temporal Information ◽

Spatio Temporal

Download Full-text

Temporal Interlacing Network

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6872 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11966-11973

Author(s):

Hao Shao ◽

Shengju Qian ◽

Yu Liu

Keyword(s):

State Of The Art ◽

Temporal Information ◽

Spatial Representations ◽

The Past ◽

Temporal Representation ◽

Temporal Features ◽

Temporal Models ◽

Long Time ◽

Temporal Model ◽

Spatio Temporal

For a long time, the vision community tries to learn the spatio-temporal representation by combining convolutional neural network together with various temporal models, such as the families of Markov chain, optical flow, RNN and temporal convolution. However, these pipelines consume enormous computing resources due to the alternately learning process for spatial and temporal information. One natural question is whether we can embed the temporal information into the spatial one so the information in the two domains can be jointly learned once-only. In this work, we answer this question by presenting a simple yet powerful operator – temporal interlacing network (TIN). Instead of learning the temporal features, TIN fuses the two kinds of information by interlacing spatial representations from the past to the future, and vice versa. A differentiable interlacing target can be learned to control the interlacing process. In this way, a heavy temporal model is replaced by a simple interlacing operator. We theoretically prove that with a learnable interlacing target, TIN performs equivalently to the regularized temporal convolution network (r-TCN), but gains 4% more accuracy with 6x less latency on 6 challenging benchmarks. These results push the state-of-the-art performances of video understanding by a considerable margin. Not surprising, the ensemble model of the proposed TIN won the 1st place in the ICCV19 - Multi Moments in Time challenge. Code is made available to facilitate further research.1

Download Full-text

INTEGRATION OF COLOR AND MOTION FEATURES FOR VIDEO RETRIEVAL

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007089 ◽

2009 ◽

Vol 23 (02) ◽

pp. 313-329 ◽

Cited By ~ 5

Author(s):

LIANG-HUA CHEN ◽

KUO-HAO CHIN ◽

HONG-YUAN MARK LIAO

Keyword(s):

Visual Cues ◽

Video Retrieval ◽

Retrieval Algorithm ◽

Compact Representation ◽

Key Frame ◽

Video Clips ◽

Video Shot ◽

Motion Features ◽

Spatio Temporal ◽

Video Matching

The usefulness of a video database depends on whether the video of interest can be easily located. In this paper, we propose a video retrieval algorithm based on the integration of several visual cues. In contrast to key-frame based representation of shot, our approach analyzes all frames within a shot to construct a compact representation of video shot. In the video matching step, by integrating the color and motion features, a similarity measure is defined to locate the occurrence of similar video clips in the database. Therefore, our approach is able to fully exploit the spatio-temporal information contained in video. Experimental results indicate that the proposed approach is effective and outperforms some existing technique.

Download Full-text

Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ◽

10.1109/cvpr.2019.00413 ◽

2019 ◽

Cited By ~ 13

Author(s):

Jiangliu Wang ◽

Jianbo Jiao ◽

Linchao Bao ◽

Shengfeng He ◽

Yunhui Liu ◽

...

Keyword(s):

Representation Learning ◽

Temporal Representation ◽

Spatio Temporal

Download Full-text

Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018545 ◽

2019 ◽

Vol 33 ◽

pp. 8545-8552 ◽

Cited By ~ 20

Author(s):

Dahun Kim ◽

Donghyeon Cho ◽

In So Kweon

Keyword(s):

Action Recognition ◽

Large Scale ◽

State Of The Art ◽

Representation Learning ◽

Space Time ◽

Temporal Relation ◽

Still Images ◽

Video Frames ◽

Spatio Temporal ◽

The Cost

Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visual representation learning for still images, when the number of labeled images is limited or absent at all. Recently, this worthwhile stream of study extends to video domain where the cost of human labeling is even more expensive. However, the most of existing methods are still based on 2D CNN architectures that can not directly capture spatio-temporal information for video applications. In this paper, we introduce a new self-supervised task called as Space-Time Cubic Puzzles to train 3D CNNs using large scale video dataset. This task requires a network to arrange permuted 3D spatio-temporal crops. By completing Space-Time Cubic Puzzles, the network learns both spatial appearance and temporal relation of video frames, which is our final goal. In experiments, we demonstrate that our learned 3D representation is well transferred to action recognition tasks, and outperforms state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.

Download Full-text