MST-CSS (Multi-Spectro-Temporal Curvature Scale Space), a Novel Spatio-Temporal Representation for Content-Based Video Retrieval

2010 ◽  
Vol 20 (8) ◽  
pp. 1080-1094 ◽  
Author(s):  
A Dyana ◽  
S Das
2020 ◽  
Vol 34 (07) ◽  
pp. 11701-11708 ◽  
Author(s):  
Dezhao Luo ◽  
Chang Liu ◽  
Yu Zhou ◽  
Dongbao Yang ◽  
Can Ma ◽  
...  

We propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich spatial-temporal representations. VCP first generates “blanks” by withholding video clips and then creates “options” by applying spatio-temporal operations on the withheld clips. Finally, it fills the blanks with “options” and learns representations by predicting the categories of operations applied on the clips. VCP can act as either a proxy task or a target task in self-supervised learning. As a proxy task, it converts rich self-supervised representations into video clip operations (options), which enhances the flexibility and reduces the complexity of representation learning. As a target task, it can assess learned representation models in a uniform and interpretable manner. With VCP, we train spatial-temporal representation models (3D-CNNs) and apply such models on action recognition and video retrieval tasks. Experiments on commonly used benchmarks show that the trained models outperform the state-of-the-art self-supervised models with significant margins.


Videos are recorded and uploaded daily to the sites like YouTube, Facebook etc. from devices such as mobile phones and digital cameras with less or without metadata (semantic tags) associated with it. This makes extremely difficult to retrieve similar videos based on this metadata without using content based semantic search. Content based video retrieval is problem of retrieving most similar videos to a given query video and has wide range of applications such as video browsing, content filtering, video indexing, etc. Traditional video level features based on key frame level hand engineered features which does not exploit rich dynamics present in the video. In this paper we propose a fast content based video retrieval framework using compact spatio-temporal features learned by deep learning. Specifically, deep CNN along with LSTM is deploy to learn spatio-temporal representations of video. For fast retrieval, binary code is generated by hashing learning component in the framework. For fast and effective learning of hash code proposed framework is trained in two stages. First stage learns the video dynamics and in second stage compact code is learn using learned video’s temporal variation from the first stage. UCF101 dataset is used to test the proposed method and results compared by other hashing methods. Results show that our approach is able to improve the performance over existing methods.


2018 ◽  
Vol 17 (1) ◽  
pp. 57-72
Author(s):  
Damiano Malafronte ◽  
Ernesto De Vito ◽  
Francesca Odone

Sign in / Sign up

Export Citation Format

Share Document