Object Priors for Classifying and Localizing Unseen Actions

AbstractThis work strives for the classification and localization of human actions in videos, without the need for any labeled video training examples. Where existing work relies on transferring global attribute or object information from seen to unseen action videos, we seek to classify and spatio-temporally localize unseen actions in videos from image-based object information only. We propose three spatial object priors, which encode local person and object detectors along with their spatial relations. On top we introduce three semantic object priors, which extend semantic matching through word embeddings with three simple functions that tackle semantic ambiguity, object discrimination, and object naming. A video embedding combines the spatial and semantic object priors. It enables us to introduce a new video retrieval task that retrieves action tubes in video collections based on user-specified objects, spatial relations, and object size. Experimental evaluation on five action datasets shows the importance of spatial and semantic object priors for unseen actions. We find that persons and objects have preferred spatial relations that benefit unseen action localization, while using multiple languages and simple object filtering directly improves semantic matching, leading to state-of-the-art results for both unseen action classification and localization.

Download Full-text

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/154 ◽

2021 ◽

Author(s):

Wenzhe Wang ◽

Mengdan Zhang ◽

Runnan Chen ◽

Guanyu Cai ◽

Penghao Zhou ◽

...

Keyword(s):

Video Retrieval ◽

State Of The Art ◽

Text Retrieval ◽

Local Alignment ◽

Global Alignment ◽

Retrieval Task ◽

Fine Grained ◽

Global View ◽

High Level ◽

Semantic Correspondence

Multi-modal cues presented in videos are usually beneficial for the challenging video-text retrieval task on internet-scale datasets. Recent video retrieval methods take advantage of multi-modal cues by aggregating them to holistic high-level semantics for matching with text representations in a global view. In contrast to this global alignment, the local alignment of detailed semantics encoded within both multi-modal cues and distinct phrases is still not well conducted. Thus, in this paper, we leverage the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence. Specifically, multi-step attention is learned for progressively comprehensive local alignment and a holistic transformer is utilized to summarize multi-modal cues for global alignment. With hierarchical alignment, our model outperforms state-of-the-art methods on three public video retrieval datasets.

Download Full-text

Semantic Object Recognition Based on Qualitative Probabilistic Spatial Relations

Formal Modeling and Verification of Cyber-Physical Systems ◽

10.1007/978-3-658-09994-7_12 ◽

2015 ◽

pp. 278-280

Author(s):

Malgorzata Goldhoorn ◽

Frank Kirchner

Keyword(s):

Object Recognition ◽

Spatial Relations ◽

Semantic Object

Download Full-text

A Proposal-Based Approach for Activity Image-to-Video Retrieval

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6941 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12524-12531

Author(s):

Ruicong Xu ◽

Li Niu ◽

Jianfu Zhang ◽

Liqing Zhang

Keyword(s):

Video Retrieval ◽

Structural Information ◽

Retrieval Task ◽

Query Image ◽

Similar Activity ◽

Triplet Loss ◽

Subspace Distance

Activity image-to-video retrieval task aims to retrieve videos containing the similar activity as the query image, which is a challenging task because videos generally have many background segments irrelevant to the activity. In this paper, we utilize R-C3D model to represent a video by a bag of activity proposals, which can filter out background segments to some extent. However, there are still noisy proposals in each bag. Thus, we propose an Activity Proposal-based Image-to-Video Retrieval (APIVR) approach, which incorporates multi-instance learning into cross-modal retrieval framework to address the proposal noise issue. Specifically, we propose a Graph Multi-Instance Learning (GMIL) module with graph convolutional layer, and integrate this module with classification loss, adversarial loss, and triplet loss in our cross-modal retrieval framework. Moreover, we propose geometry-aware triplet loss based on point-to-subspace distance to preserve the structural information of activity proposals. Extensive experiments on three widely-used datasets verify the effectiveness of our approach.

Download Full-text

Large-scale Text-based Video Classification using Contextual Features

European Journal of Electrical Engineering and Computer Science ◽

10.24018/ejece.2019.3.2.68 ◽

2019 ◽

Vol 3 (2) ◽

Author(s):

Zein Al Abidin Ibrahim ◽

Siba Haidar ◽

Ihab Sbeity

Keyword(s):

Deep Learning ◽

Text Classification ◽

Large Scale ◽

Video Retrieval ◽

Representation Learning ◽

Experimental Results ◽

Video Classification ◽

Retrieval Task ◽

Contextual Features ◽

Specific Category

The production of video has increased and expanded dramatically. There is a need to reach accurate video classification. In our work, we use deep learning as a mean to accelerate the video retrieval task by classifying them into categories. We classify a video depending on the text extracted from it. We trained our model using fastText, a library for efficient text classification and representation learning, and tested our model on 15000 videos. Experimental results show that our approach is efficient and has good performance. Our technique can be used on huge datasets. It produces a model that can be used to classify any video into a specific category very quickly.

Download Full-text

Semantic Object Search Using Semantic Categories and Spatial Relations between Objects

RoboCup 2013: Robot World Cup XVII - Lecture Notes in Computer Science ◽

10.1007/978-3-662-44468-9_45 ◽

2014 ◽

pp. 516-527 ◽

Cited By ~ 1

Author(s):

Patricio Loncomilla ◽

Marcelo Saavedra ◽

Javier Ruiz-del-Solar

Keyword(s):

Spatial Relations ◽

Semantic Categories ◽

Object Search ◽

Semantic Object

Download Full-text

A New Interactive Video Retrieval Framework Using Semantic Matching

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2008.01243 ◽

2009 ◽

Vol 34 (10) ◽

pp. 1243-1249

Author(s):

Hua-Bei LI ◽

Wei-Ming HU ◽

Guan LUO

Keyword(s):

Interactive Video ◽

Video Retrieval ◽

Semantic Matching ◽

Interactive Video Retrieval

Download Full-text

Developing context model supporting spatial relations for semantic video retrieval

2010 International Conference on Information Retrieval & Knowledge Management (CAMP) ◽

10.1109/infrkm.2010.5466951 ◽

2010 ◽

Cited By ~ 1

Author(s):

Sara Memar ◽

Mohammadreza Ektefa ◽

Lilly Suriani Affendey

Keyword(s):

Video Retrieval ◽

Spatial Relations ◽

Context Model

Download Full-text

Improved preservation of the otolithic membranes

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100128857 ◽

1987 ◽

Vol 45 ◽

pp. 914-915

Author(s):

G. M. Cohen ◽

J. S. Grasso ◽

M. L. Domeier ◽

P. T. Mangonon

Keyword(s):

Toluidine Blue ◽

Potassium Phosphate ◽

Vestibular Function ◽

Spatial Relations ◽

White Leghorn ◽

Supporting Cells ◽

Semithin Sections ◽

Micromechanical Models ◽

Soluble Components

Any explanation of vestibular micromechanics must include the roles of the otolithic and cupular membranes. However, micromechanical models of vestibular function have been hampered by unresolved questions about the microarchitectures of these membranes and their connections to stereocilia and supporting cells. Otolithic membranes are notoriously difficult to preserve because of severe shrinkage and loss of soluble components. We have empirically developed fixation procedures that reduce shrinkage artifacts and more accurately depict the spatial relations between the otolithic membranes and the ciliary bundles and supporting cells.We used White Leghorn chicks, ranging in age from newly hatched to one week. The inner ears were fixed for 3-24 h in 1.5-1.75% glutaraldehyde in 150 mM KCl, buffered with potassium phosphate, pH 7.3; when postfixed, it was for 30 min in 1% OsO4 alone or mixed with 1% K4Fe(CN)6. The otolithic organs (saccule, utricle, lagenar macula) were embedded in Araldite 502. Semithin sections (1 μ) were stained with toluidine blue.

Download Full-text

Frontotemporal Lobar Degeneration: Characterizing Semantic Binding and Abstracted Meaning Abilities

Perspectives on Neurophysiology and Neurogenic Speech and Language Disorders ◽

10.1044/nnsld19.4.117 ◽

2009 ◽

Vol 19 (4) ◽

pp. 117-125 ◽

Cited By ~ 1

Author(s):

Raksha Anand ◽

John Hart ◽

Patricia S. Moore ◽

Sandra B. Chapman

Keyword(s):

Frontotemporal Lobar Degeneration ◽

Assessment Tools ◽

Progressive Decline ◽

Cognitively Normal ◽

Nonfluent Aphasia ◽

Traditional Assessment ◽

Language Measures ◽

Semantic Object ◽

Semantic Problem ◽

Progressive Nonfluent Aphasia

Abstract Purpose: Frontotemporal lobar degeneration (FTLD) encompasses a group of neurodegenerative disorders characterized by gradual and progressive decline in behavior and/or language. Identifying the subtypes of FTLD can be challenging with traditional assessment tools. Growing empirical evidence suggests that language measures might be useful in differentiating FTLD subtypes. Method: In this paper, we examined the performance of five individuals with FTLD (two with frontotemporal dementia, two with semantic dementia, and one with progressive nonfluent aphasia) and 10 cognitively normal older adults on measures of semantic binding (Semantic Object Retrieval Test and semantic problem solving) and abstracted meaning (generation of interpretive statement and proverb interpretation). Results and Conclusion: A differential profile of impairment was observed in the three FTLD subtypes on these four measures. Further examination of these measures in larger groups will establish their clinical utility in differentiating the FTLD subtypes.

Download Full-text

Differences in the Coding of Spatial Relations in Animals and Objects

PsycEXTRA Dataset ◽

10.1037/e413782005-188 ◽

1999 ◽

Author(s):

Eric E. Cooper ◽

Brian E. Brooks

Keyword(s):

Spatial Relations

Download Full-text