scholarly journals Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering

2020 ◽  
Vol 34 (07) ◽  
pp. 11101-11108
Author(s):  
Jianwen Jiang ◽  
Ziqiang Chen ◽  
Haojie Lin ◽  
Xibin Zhao ◽  
Yue Gao

Understanding questions and finding clues for answers are the key for video question answering. Compared with image question answering, video question answering (Video QA) requires to find the clues accurately on both spatial and temporal dimension simultaneously, and thus is more challenging. However, the relationship between spatio-temporal information and question still has not been well utilized in most existing methods for Video QA. To tackle this problem, we propose a Question-Guided Spatio-Temporal Contextual Attention Network (QueST) method. In QueST, we divide the semantic features generated from question into two separate parts: the spatial part and the temporal part, respectively guiding the process of constructing the contextual attention on spatial and temporal dimension. Under the guidance of the corresponding contextual attention, visual features can be better exploited on both spatial and temporal dimensions. To evaluate the effectiveness of the proposed method, experiments are conducted on TGIF-QA dataset, MSRVTT-QA dataset and MSVD-QA dataset. Experimental results and comparisons with the state-of-the-art methods have shown that our method can achieve superior performance.

Author(s):  
Jie Lei ◽  
Licheng Yu ◽  
Tamara Berg ◽  
Mohit Bansal

Author(s):  
Zhou Zhao ◽  
Qifan Yang ◽  
Deng Cai ◽  
Xiaofei He ◽  
Yueting Zhuang

Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the temporal dynamics of video contents. In this paper, we consider the problem of open-ended video question answering from the viewpoint of spatio-temporal attentional encoder-decoder learning framework. We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question. We then develop the encoder-decoder learning method with reasoning recurrent neural networks for open-ended video question answering. We construct a large-scale video question answering dataset. The extensive experiments show the effectiveness of our method.


Information ◽  
2021 ◽  
Vol 12 (3) ◽  
pp. 136
Author(s):  
Shuang Liu ◽  
Nannan Tan ◽  
Yaqian Ge ◽  
Niko Lukač

Question-answering systems based on knowledge graphs are extremely challenging tasks in the field of natural language processing. Most of the existing Chinese Knowledge Base Question Answering(KBQA) can only return the knowledge stored in the knowledge base by extractive methods. Nevertheless, this processing does not conform to the reading habits and cannot solve the Out-of-vocabulary(OOV) problem. In this paper, a new generative question answering method based on knowledge graph is proposed, including three parts of knowledge vocabulary construction, data pre-processing, and answer generation. In the word list construction, BiLSTM-CRF is used to identify the entity in the source text, finding the triples contained in the entity, counting the word frequency, and constructing it. In the part of data pre-processing, a pre-trained language model BERT combining word frequency semantic features is adopted to obtain word vectors. In the answer generation part, one combination of a vocabulary constructed by the knowledge graph and a pointer generator network(PGN) is proposed to point to the corresponding entity for generating answer. The experimental results show that the proposed method can achieve superior performance on WebQA datasets than other methods.


Author(s):  
Zhou Zhao ◽  
Xinghua Jiang ◽  
Deng Cai ◽  
Jun Xiao ◽  
Xiaofei He ◽  
...  

Conversational video question answering is a challenging task in visual information retrieval, which generates the accurate answer from the referenced video contents according to the visual conversation context and given question. However, the existing visual question answering methods mainly tackle the problem of single-turn video question answering, which may be ineffectively applied for multi-turn video question answering directly, due to the insufficiency of modeling the sequential conversation context. In this paper, we study the problem of multi-turn video question answering from the viewpoint of multi-step hierarchical attention context network learning. We first propose the hierarchical attention context network for context-aware question understanding by modeling the hierarchically sequential conversation context structure. We then develop the multi-stream spatio-temporal attention network for learning the joint representation of the dynamic video contents and context-aware question embedding. We next devise the hierarchical attention context network learning method with multi-step reasoning process for multi-turn video question answering. We construct two large-scale multi-turn video question answering datasets. The extensive experiments show the effectiveness of our method.


2019 ◽  
Vol 127 (10) ◽  
pp. 1385-1412
Author(s):  
Yunseok Jang ◽  
Yale Song ◽  
Chris Dongjoo Kim ◽  
Youngjae Yu ◽  
Youngjin Kim ◽  
...  

Author(s):  
Long Hoang Dang ◽  
Thao Minh Le ◽  
Vuong Le ◽  
Truyen Tran

Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities. This task necessitates learning to reason about objects, relations, and events across visual and linguistic domains in space-time. High-level reasoning demands lifting from associative visual pattern recognition to symbol like manipulation over objects, their behavior and interactions. Toward reaching this goal we propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects. At each stage of the video event flow, these objects interact with each other, and their interactions are reasoned about with respect to the query and under the overall context of a video. This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture called Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR) networks. This neural model maintains the objects' consistent lifelines in the form of a hierarchically nested spatio-temporal graph. Within this graph, the dynamic interactive object-oriented representations are built up along the video sequence, hierarchically abstracted in a bottom-up manner, and converge toward the key information for the correct answer. The method is evaluated on multiple major Video QA datasets and establishes new state-of-the-arts in these tasks. Analysis into the model's behavior indicates that object-oriented reasoning is a reliable, interpretable and efficient approach to Video QA.


2019 ◽  
Author(s):  
Hongyin Luo ◽  
Mitra Mohtarami ◽  
James Glass ◽  
Karthik Krishnamurthy ◽  
Brigitte Richardson

Sensors ◽  
2021 ◽  
Vol 21 (9) ◽  
pp. 3099
Author(s):  
V. Javier Traver ◽  
Judith Zorío ◽  
Luis A. Leiva

Temporal salience considers how visual attention varies over time. Although visual salience has been widely studied from a spatial perspective, its temporal dimension has been mostly ignored, despite arguably being of utmost importance to understand the temporal evolution of attention on dynamic contents. To address this gap, we proposed Glimpse, a novel measure to compute temporal salience based on the observer-spatio-temporal consistency of raw gaze data. The measure is conceptually simple, training free, and provides a semantically meaningful quantification of visual attention over time. As an extension, we explored scoring algorithms to estimate temporal salience from spatial salience maps predicted with existing computational models. However, these approaches generally fall short when compared with our proposed gaze-based measure. Glimpse could serve as the basis for several downstream tasks such as segmentation or summarization of videos. Glimpse’s software and data are publicly available.


Sign in / Sign up

Export Citation Format

Share Document