Temporal Attention and Consistency Measuring for Video Question Answering

Author(s):  
Lingyu Zhang ◽  
Richard J. Radke
Author(s):  
Zhou Zhao ◽  
Qifan Yang ◽  
Deng Cai ◽  
Xiaofei He ◽  
Yueting Zhuang

Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the temporal dynamics of video contents. In this paper, we consider the problem of open-ended video question answering from the viewpoint of spatio-temporal attentional encoder-decoder learning framework. We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question. We then develop the encoder-decoder learning method with reasoning recurrent neural networks for open-ended video question answering. We construct a large-scale video question answering dataset. The extensive experiments show the effectiveness of our method.


2019 ◽  
Author(s):  
Hongyin Luo ◽  
Mitra Mohtarami ◽  
James Glass ◽  
Karthik Krishnamurthy ◽  
Brigitte Richardson

2018 ◽  
Vol 314 ◽  
pp. 386-393 ◽  
Author(s):  
Wenqing Chu ◽  
Hongyang Xue ◽  
Zhou Zhao ◽  
Deng Cai ◽  
Chengwei Yao

Sign in / Sign up

Export Citation Format

Share Document