Spatio-temporal salient feature extraction for perceptual content based video retrieval

Author(s):  
Sameh Megrhi ◽  
Wided Souidene ◽  
Azeddine Beghdadi

Videos are recorded and uploaded daily to the sites like YouTube, Facebook etc. from devices such as mobile phones and digital cameras with less or without metadata (semantic tags) associated with it. This makes extremely difficult to retrieve similar videos based on this metadata without using content based semantic search. Content based video retrieval is problem of retrieving most similar videos to a given query video and has wide range of applications such as video browsing, content filtering, video indexing, etc. Traditional video level features based on key frame level hand engineered features which does not exploit rich dynamics present in the video. In this paper we propose a fast content based video retrieval framework using compact spatio-temporal features learned by deep learning. Specifically, deep CNN along with LSTM is deploy to learn spatio-temporal representations of video. For fast retrieval, binary code is generated by hashing learning component in the framework. For fast and effective learning of hash code proposed framework is trained in two stages. First stage learns the video dynamics and in second stage compact code is learn using learned video’s temporal variation from the first stage. UCF101 dataset is used to test the proposed method and results compared by other hashing methods. Results show that our approach is able to improve the performance over existing methods.


There has been a revolution in multimedia with technological advancement. Hence, Video recording has increased in leaps and bounds. Video retrieval from a huge database is cumbersome by the existing text based search since a lot of human effort is involved and the retrieval efficiency is meager as well. In view of the present challenges, video retrieval based on video content prevails over the existing conventional methods. Content implies real video information such as video features. The performance of the Content Based Video Retrieval (CBVR) depends on Feature extraction and similar features matching. Since the selection of features in the existing algorithms is not effective, the retrieval processing time is more and the efficiency is less. Combined features of color and motion have been proposed for feature extraction and Spatio-Temporal Scale Invariant Feature Transform is used for Shot Boundary Detection. Since the characteristic of color feature is visual video content and that of motion feature is temporal content, these two features are significant in effective video retrieval. The performance of the CBVR system has been evaluated on the TRECVID dataset and the retrieved videos reveal the effectiveness of proposed algorithm.


Sign in / Sign up

Export Citation Format

Share Document