Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/513 ◽

2018 ◽

Cited By ~ 5

Author(s):

Zhou Zhao ◽

Xinghua Jiang ◽

Deng Cai ◽

Jun Xiao ◽

Xiaofei He ◽

...

Keyword(s):

Visual Information ◽

Large Scale ◽

Question Answering ◽

Context Aware ◽

Visual Information Retrieval ◽

Network Learning ◽

Single Turn ◽

Spatio Temporal ◽

Joint Representation ◽

Video Question Answering

Conversational video question answering is a challenging task in visual information retrieval, which generates the accurate answer from the referenced video contents according to the visual conversation context and given question. However, the existing visual question answering methods mainly tackle the problem of single-turn video question answering, which may be ineffectively applied for multi-turn video question answering directly, due to the insufficiency of modeling the sequential conversation context. In this paper, we study the problem of multi-turn video question answering from the viewpoint of multi-step hierarchical attention context network learning. We first propose the hierarchical attention context network for context-aware question understanding by modeling the hierarchically sequential conversation context structure. We then develop the multi-stream spatio-temporal attention network for learning the joint representation of the dynamic video contents and context-aware question embedding. We next devise the hierarchical attention context network learning method with multi-step reasoning process for multi-turn video question answering. We construct two large-scale multi-turn video question answering datasets. The extensive experiments show the effectiveness of our method.

Download Full-text

Video Question Answering via Hierarchical Spatio-Temporal Attention Networks

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/492 ◽

2017 ◽

Cited By ~ 23

Author(s):

Zhou Zhao ◽

Qifan Yang ◽

Deng Cai ◽

Xiaofei He ◽

Yueting Zhuang

Keyword(s):

Visual Information ◽

Large Scale ◽

Question Answering ◽

Temporal Dynamics ◽

Temporal Attention ◽

Attention Networks ◽

Learning Framework ◽

Spatio Temporal ◽

The Given ◽

Video Question Answering

Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the temporal dynamics of video contents. In this paper, we consider the problem of open-ended video question answering from the viewpoint of spatio-temporal attentional encoder-decoder learning framework. We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question. We then develop the encoder-decoder learning method with reasoning recurrent neural networks for open-ended video question answering. We construct a large-scale video question answering dataset. The extensive experiments show the effectiveness of our method.

Download Full-text

Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/512 ◽

2018 ◽

Cited By ~ 8

Author(s):

Zhou Zhao ◽

Zhu Zhang ◽

Shuwen Xiao ◽

Zhou Yu ◽

Jun Yu ◽

...

Keyword(s):

Natural Language ◽

Visual Information ◽

Large Scale ◽

Question Answering ◽

Video Segmentation ◽

Short Form ◽

Semantic Representation ◽

Adaptive Video ◽

Long Form ◽

Video Question Answering

Open-ended long-form video question answering is challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced long-form video content according to the question. However, the existing video question answering works mainly focus on the short-form video question answering, due to the lack of modeling the semantic representation of long-form video contents. In this paper, we consider the problem of long-form video question answering from the viewpoint of adaptive hierarchical reinforced encoder-decoder network learning. We propose the adaptive hierarchical encoder network to learn the joint representation of the long-form video contents according to the question with adaptive video segmentation. we then develop the reinforced decoder network to generate the natural language answer for open-ended video question answering. We construct a large-scale long-form video question answering dataset. The extensive experiments show the effectiveness of our method.

Download Full-text

TVQA+: Spatio-Temporal Grounding for Video Question Answering

10.18653/v1/2020.acl-main.730 ◽

2020 ◽

Cited By ~ 5

Author(s):

Jie Lei ◽

Licheng Yu ◽

Tamara Berg ◽

Mohit Bansal

Keyword(s):

Question Answering ◽

Spatio Temporal ◽

Video Question Answering

Download Full-text

Video Question Answering via Hierarchical Dual-Level Attention Network Learning

Proceedings of the 2017 ACM on Multimedia Conference - MM '17 ◽

10.1145/3123266.3123364 ◽

2017 ◽

Cited By ~ 15

Author(s):

Zhou Zhao ◽

Jinghao Lin ◽

Xinghua Jiang ◽

Deng Cai ◽

Xiaofei He ◽

...

Keyword(s):

Question Answering ◽

Attention Network ◽

Network Learning ◽

Video Question Answering

Download Full-text

Spatio-Temporal Context Networks for Video Question Answering

Advances in Multimedia Information Processing – PCM 2017 - Lecture Notes in Computer Science ◽

10.1007/978-3-319-77383-4_11 ◽

2018 ◽

pp. 108-118

Author(s):

Kun Gao ◽

Yahong Han

Keyword(s):

Question Answering ◽

Temporal Context ◽

Spatio Temporal ◽

Video Question Answering

Download Full-text

An Analytics Platform for Integrating and Computing Spatio-Temporal Metrics

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi8020054 ◽

2019 ◽

Vol 8 (2) ◽

pp. 54 ◽

Cited By ~ 2

Author(s):

Luis Rodríguez-Pupo ◽

Carlos Granell ◽

Sven Casteleyn

Keyword(s):

Psychological Health ◽

Large Scale ◽

Context Aware ◽

Data Intensive ◽

Application Logic ◽

Context Data ◽

Spatio Temporal ◽

Application Fields ◽

Post Hoc ◽

Application Developers

In large-scale context-aware applications, a central design concern is capturing, managing and acting upon location and context data. The ability to understand the collected data and define meaningful contextual events, based on one or more incoming (contextual) data streams, both for a single and multiple users, is hereby critical for applications to exhibit location- and context-aware behaviour. In this article, we describe a context-aware, data-intensive metrics platform —focusing primarily on its geospatial support—that allows exactly this: to define and execute metrics, which capture meaningful spatio-temporal and contextual events relevant for the application realm. The platform (1) supports metrics definition and execution; (2) provides facilities for real-time, in-application actions upon metrics execution results; (3) allows post-hoc analysis and visualisation of collected data and results. It hereby offers contextual and geospatial data management and analytics as a service, and allow context-aware application developers to focus on their core application logic. We explain the core platform and its ecosystem of supporting applications and tools, elaborate the most important conceptual features, and discuss implementation realised through a distributed, micro-service based cloud architecture. Finally, we highlight possible application fields, and present a real-world case study in the realm of psychological health.

Download Full-text

Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6766 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11101-11108

Author(s):

Jianwen Jiang ◽

Ziqiang Chen ◽

Haojie Lin ◽

Xibin Zhao ◽

Yue Gao

Keyword(s):

Question Answering ◽

Divide And Conquer ◽

Superior Performance ◽

Temporal Part ◽

Semantic Features ◽

Temporal Dimension ◽

Spatial Part ◽

Spatio Temporal ◽

Temporal Dimensions ◽

Video Question Answering

Understanding questions and finding clues for answers are the key for video question answering. Compared with image question answering, video question answering (Video QA) requires to find the clues accurately on both spatial and temporal dimension simultaneously, and thus is more challenging. However, the relationship between spatio-temporal information and question still has not been well utilized in most existing methods for Video QA. To tackle this problem, we propose a Question-Guided Spatio-Temporal Contextual Attention Network (QueST) method. In QueST, we divide the semantic features generated from question into two separate parts: the spatial part and the temporal part, respectively guiding the process of constructing the contextual attention on spatial and temporal dimension. Under the guidance of the corresponding contextual attention, visual features can be better exploited on both spatial and temporal dimensions. To evaluate the effectiveness of the proposed method, experiments are conducted on TGIF-QA dataset, MSRVTT-QA dataset and MSVD-QA dataset. Experimental results and comparisons with the state-of-the-art methods have shown that our method can achieve superior performance.

Download Full-text

Video question answering via grounded cross-attention network learning

Information Processing & Management ◽

10.1016/j.ipm.2020.102265 ◽

2020 ◽

Vol 57 (4) ◽

pp. 102265

Author(s):

Yunan Ye ◽

Shifeng Zhang ◽

Yimeng Li ◽

Xufeng Qian ◽

Siliang Tang ◽

...

Keyword(s):

Question Answering ◽

Attention Network ◽

Network Learning ◽

Video Question Answering

Download Full-text

Video Question Answering via Attribute-Augmented Attention Network Learning

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '17 ◽

10.1145/3077136.3080655 ◽

2017 ◽

Cited By ~ 19

Author(s):

Yunan Ye ◽

Zhou Zhao ◽

Yimeng Li ◽

Long Chen ◽

Jun Xiao ◽

...

Keyword(s):

Question Answering ◽

Attention Network ◽

Network Learning ◽

Video Question Answering

Download Full-text

Video Question Answering with Spatio-Temporal Reasoning

International Journal of Computer Vision ◽

10.1007/s11263-019-01189-x ◽

2019 ◽

Vol 127 (10) ◽

pp. 1385-1412

Author(s):

Yunseok Jang ◽

Yale Song ◽

Chris Dongjoo Kim ◽

Youngjae Yu ◽

Youngjin Kim ◽

...

Keyword(s):

Question Answering ◽

Temporal Reasoning ◽

Spatio Temporal ◽

Video Question Answering

Download Full-text