scholarly journals Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network

Author(s):  
Zhou Zhao ◽  
Xinghua Jiang ◽  
Deng Cai ◽  
Jun Xiao ◽  
Xiaofei He ◽  
...  

Conversational video question answering is a challenging task in visual information retrieval, which generates the accurate answer from the referenced video contents according to the visual conversation context and given question. However, the existing visual question answering methods mainly tackle the problem of single-turn video question answering, which may be ineffectively applied for multi-turn video question answering directly, due to the insufficiency of modeling the sequential conversation context. In this paper, we study the problem of multi-turn video question answering from the viewpoint of multi-step hierarchical attention context network learning. We first propose the hierarchical attention context network for context-aware question understanding by modeling the hierarchically sequential conversation context structure. We then develop the multi-stream spatio-temporal attention network for learning the joint representation of the dynamic video contents and context-aware question embedding. We next devise the hierarchical attention context network learning method with multi-step reasoning process for multi-turn video question answering. We construct two large-scale multi-turn video question answering datasets. The extensive experiments show the effectiveness of our method.

Author(s):  
Zhou Zhao ◽  
Qifan Yang ◽  
Deng Cai ◽  
Xiaofei He ◽  
Yueting Zhuang

Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the temporal dynamics of video contents. In this paper, we consider the problem of open-ended video question answering from the viewpoint of spatio-temporal attentional encoder-decoder learning framework. We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question. We then develop the encoder-decoder learning method with reasoning recurrent neural networks for open-ended video question answering. We construct a large-scale video question answering dataset. The extensive experiments show the effectiveness of our method.


Author(s):  
Zhou Zhao ◽  
Zhu Zhang ◽  
Shuwen Xiao ◽  
Zhou Yu ◽  
Jun Yu ◽  
...  

Open-ended long-form video question answering is challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced long-form video content according to the question. However, the existing video question answering works mainly focus on the short-form video question answering, due to the lack of modeling the semantic representation of long-form video contents. In this paper, we consider the problem of long-form video question answering from the viewpoint of adaptive hierarchical reinforced encoder-decoder network learning. We propose the adaptive hierarchical encoder network to learn the joint representation of the long-form video contents according to the question with adaptive video segmentation. we then develop the reinforced decoder network to generate the natural language answer for open-ended video question answering. We construct a large-scale long-form video question answering dataset. The extensive experiments show the effectiveness of our method.


Author(s):  
Jie Lei ◽  
Licheng Yu ◽  
Tamara Berg ◽  
Mohit Bansal

2019 ◽  
Vol 8 (2) ◽  
pp. 54 ◽  
Author(s):  
Luis Rodríguez-Pupo ◽  
Carlos Granell ◽  
Sven Casteleyn

In large-scale context-aware applications, a central design concern is capturing, managing and acting upon location and context data. The ability to understand the collected data and define meaningful contextual events, based on one or more incoming (contextual) data streams, both for a single and multiple users, is hereby critical for applications to exhibit location- and context-aware behaviour. In this article, we describe a context-aware, data-intensive metrics platform —focusing primarily on its geospatial support—that allows exactly this: to define and execute metrics, which capture meaningful spatio-temporal and contextual events relevant for the application realm. The platform (1) supports metrics definition and execution; (2) provides facilities for real-time, in-application actions upon metrics execution results; (3) allows post-hoc analysis and visualisation of collected data and results. It hereby offers contextual and geospatial data management and analytics as a service, and allow context-aware application developers to focus on their core application logic. We explain the core platform and its ecosystem of supporting applications and tools, elaborate the most important conceptual features, and discuss implementation realised through a distributed, micro-service based cloud architecture. Finally, we highlight possible application fields, and present a real-world case study in the realm of psychological health.


2020 ◽  
Vol 34 (07) ◽  
pp. 11101-11108
Author(s):  
Jianwen Jiang ◽  
Ziqiang Chen ◽  
Haojie Lin ◽  
Xibin Zhao ◽  
Yue Gao

Understanding questions and finding clues for answers are the key for video question answering. Compared with image question answering, video question answering (Video QA) requires to find the clues accurately on both spatial and temporal dimension simultaneously, and thus is more challenging. However, the relationship between spatio-temporal information and question still has not been well utilized in most existing methods for Video QA. To tackle this problem, we propose a Question-Guided Spatio-Temporal Contextual Attention Network (QueST) method. In QueST, we divide the semantic features generated from question into two separate parts: the spatial part and the temporal part, respectively guiding the process of constructing the contextual attention on spatial and temporal dimension. Under the guidance of the corresponding contextual attention, visual features can be better exploited on both spatial and temporal dimensions. To evaluate the effectiveness of the proposed method, experiments are conducted on TGIF-QA dataset, MSRVTT-QA dataset and MSVD-QA dataset. Experimental results and comparisons with the state-of-the-art methods have shown that our method can achieve superior performance.


2020 ◽  
Vol 57 (4) ◽  
pp. 102265
Author(s):  
Yunan Ye ◽  
Shifeng Zhang ◽  
Yimeng Li ◽  
Xufeng Qian ◽  
Siliang Tang ◽  
...  

2019 ◽  
Vol 127 (10) ◽  
pp. 1385-1412
Author(s):  
Yunseok Jang ◽  
Yale Song ◽  
Chris Dongjoo Kim ◽  
Youngjae Yu ◽  
Youngjin Kim ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document