Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw

This paper proposes a novel pretext task for self-supervised video representation learning by exploiting spatiotemporal continuity in videos. It is motivated by the fact that videos are spatiotemporal by nature and a representation learned by detecting spatiotemporal continuity/discontinuity is thus beneficial for downstream video content analysis tasks. A natural choice of such a pretext task is to construct spatiotemporal (3D) jigsaw puzzles and learn to solve them. However, as we demonstrate in the experiments, this task turns out to be intractable. We thus propose Constrained Spatiotemporal Jigsaw (CSJ) whereby the 3D jigsaws are formed in a constrained manner to ensure that large continuous spatiotemporal cuboids exist. This provides sufficient cues for the model to reason about the continuity. Instead of solving them directly, which could still be extremely hard, we carefully design four surrogate tasks that are more solvable. The four tasks aim to learn representations sensitive to spatiotemporal continuity at both the local and global levels. Extensive experiments show that our CSJ achieves state-of-the-art on various benchmarks.

Download Full-text

Constructing a bowling information system with video content analysis

Proceedings of the first ACM international workshop on Multimedia databases - MMDB 2003 ◽

10.1145/951676.951681 ◽

2003 ◽

Author(s):

Wen Wen Hsieh ◽

Arbee L.P. Chen

Keyword(s):

Information System ◽

Content Analysis ◽

Video Content ◽

Video Content Analysis

Download Full-text

An expert fuzzy system to detect dangerous circumstances due to children in the traffic areas from the video content analysis

Expert Systems with Applications ◽

10.1016/j.eswa.2012.02.051 ◽

2012 ◽

Vol 39 (10) ◽

pp. 9108-9117 ◽

Cited By ~ 3

Author(s):

M.D. Ruiz-Lozano ◽

J. Medina ◽

M. Delgado ◽

J.L. Castro

Keyword(s):

Content Analysis ◽

Fuzzy System ◽

Video Content ◽

Video Content Analysis

Download Full-text

Multimodal Local-Global Attention Network for Affective Video Content Analysis

IEEE Transactions on Circuits and Systems for Video Technology ◽

10.1109/tcsvt.2020.3014889 ◽

2021 ◽

pp. 1-1

Author(s):

Yangjun Ou ◽

Zhenzhong Chen ◽

Feng Wu

Keyword(s):

Content Analysis ◽

Video Content ◽

Attention Network ◽

Video Content Analysis

Download Full-text

Real-time video content analysis tool for consumer media storage system

IEEE Transactions on Consumer Electronics ◽

10.1109/tce.2006.1706483 ◽

2006 ◽

Vol 52 (3) ◽

pp. 870-878 ◽

Cited By ~ 15

Author(s):

Jungong Han ◽

D. Farin ◽

P.H.N. de With ◽

Weilun Lao

Keyword(s):

Content Analysis ◽

Real Time ◽

Storage System ◽

Analysis Tool ◽

Video Content ◽

Video Content Analysis ◽

Media Storage

Download Full-text

One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search

Proceedings of the ACM on Measurement and Analysis of Computing Systems ◽

10.1145/3491046 ◽

2021 ◽

Vol 5 (3) ◽

pp. 1-34

Author(s):

Bingqian Lu ◽

Jianyi Yang ◽

Weiwen Jiang ◽

Yiyu Shi ◽

Shaolei Ren

Keyword(s):

State Of The Art ◽

Autonomous Driving ◽

Pareto Optimal ◽

Video Content ◽

Fast Evaluation ◽

Video Content Analysis ◽

Search Spaces ◽

Neural Architecture ◽

Real World Applications ◽

Prohibitive Cost

Convolutional neural networks (CNNs) are used in numerous real-world applications such as vision-based autonomous driving and video content analysis. To run CNN inference on various target devices, hardware-aware neural architecture search (NAS) is crucial. A key requirement of efficient hardware-aware NAS is the fast evaluation of inference latencies in order to rank different architectures. While building a latency predictor for each target device has been commonly used in state of the art, this is a very time-consuming process, lacking scalability in the presence of extremely diverse devices. In this work, we address the scalability challenge by exploiting latency monotonicity --- the architecture latency rankings on different devices are often correlated. When strong latency monotonicity exists, we can re-use architectures searched for one proxy device on new target devices, without losing optimality. In the absence of strong latency monotonicity, we propose an efficient proxy adaptation technique to significantly boost the latency monotonicity. Finally, we validate our approach and conduct experiments with devices of different platforms on multiple mainstream search spaces, including MobileNet-V2, MobileNet-V3, NAS-Bench-201, ProxylessNAS and FBNet. Our results highlight that, by using just one proxy device, we can find almost the same Pareto-optimal architectures as the existing per-device NAS, while avoiding the prohibitive cost of building a latency predictor for each device.

Download Full-text