Fusing Temporally Distributed Multi-Modal Semantic Clues for Video Question Answering

2021 IEEE International Conference on Multimedia and Expo (ICME) ◽

10.1109/icme51207.2021.9428225 ◽

2021 ◽

Author(s):

Fuwei Zhang ◽

Ruomei Wang ◽

Songhua Xu ◽

Fan Zhou

Keyword(s):

Question Answering ◽

Modal Semantic ◽

Video Question Answering

Download Full-text

Integrating Video Retrieval and Moment Detection in a Unified Corpus for Video Question Answering

10.21437/interspeech.2019-1736 ◽

2019 ◽

Author(s):

Hongyin Luo ◽

Mitra Mohtarami ◽

James Glass ◽

Karthik Krishnamurthy ◽

Brigitte Richardson

Keyword(s):

Question Answering ◽

Video Retrieval ◽

Video Question Answering

Download Full-text

Temporal Attention and Consistency Measuring for Video Question Answering

Proceedings of the 2020 International Conference on Multimodal Interaction ◽

10.1145/3382507.3418886 ◽

2020 ◽

Author(s):

Lingyu Zhang ◽

Richard J. Radke

Keyword(s):

Question Answering ◽

Temporal Attention ◽

Video Question Answering

Download Full-text

BVideoQA: Online English/Chinese bilingual video question answering

Journal of the American Society for Information Science and Technology ◽

10.1002/asi.21002 ◽

2009 ◽

Vol 60 (3) ◽

pp. 509-525 ◽

Author(s):

Yue-Shi Lee ◽

Yu-Chieh Wu ◽

Jie-Chi Yang

Keyword(s):

Question Answering ◽

Video Question Answering

Download Full-text

Multi-interaction Network with Object Relation for Video Question Answering

Proceedings of the 27th ACM International Conference on Multimedia ◽

10.1145/3343031.3351065 ◽

2019 ◽

Author(s):

Weike Jin ◽

Zhou Zhao ◽

Mao Gu ◽

Jun Yu ◽

Jun Xiao ◽

...

Keyword(s):

Question Answering ◽

Interaction Network ◽

Object Relation ◽

Video Question Answering

Download Full-text

Explore Multi-Step Reasoning in Video Question Answering

Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild - CoVieW'18 ◽

10.1145/3265987.3265996 ◽

2018 ◽

Author(s):

Yahong Han

Keyword(s):

Question Answering ◽

Video Question Answering

Download Full-text

Hierarchical Conditional Relation Networks for Video Question Answering

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ◽

10.1109/cvpr42600.2020.00999 ◽

2020 ◽

Author(s):

Thao Minh Le ◽

Vuong Le ◽

Svetha Venkatesh ◽

Truyen Tran

Keyword(s):

Question Answering ◽

Video Question Answering

Download Full-text

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ◽

10.1109/cvpr.2019.00210 ◽

2019 ◽

Author(s):

Chenyou Fan ◽

Xiaofan Zhang ◽

Shu Zhang ◽

Wensheng Wang ◽

Chi Zhang ◽

...

Keyword(s):

Question Answering ◽

Attention Model ◽

Multimodal Attention ◽

Video Question Answering

Download Full-text

The forgettable-watcher model for video question answering

Neurocomputing ◽

10.1016/j.neucom.2018.06.069 ◽

2018 ◽

Vol 314 ◽

pp. 386-393 ◽

Author(s):

Wenqing Chu ◽

Hongyang Xue ◽

Zhou Zhao ◽

Deng Cai ◽

Chengwei Yao

Keyword(s):

Question Answering ◽

Video Question Answering

Download Full-text

Hierarchical Relational Attention for Video Question Answering

2018 25th IEEE International Conference on Image Processing (ICIP) ◽

10.1109/icip.2018.8451103 ◽

2018 ◽

Author(s):

Muhammad Iqbal Hasan Chowdhury ◽

Kien Nguyen ◽

Sridha Sridharan ◽

Clinton Fookes

Keyword(s):

Question Answering ◽

Video Question Answering

Download Full-text

Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018658 ◽

2019 ◽

Vol 33 ◽

pp. 8658-8665 ◽

Author(s):

Xiangpeng Li ◽

Jingkuan Song ◽

Lianli Gao ◽

Xianglong Liu ◽

Wenbing Huang ◽

...

Keyword(s):

Question Answering ◽

State Of The Art ◽

Computation Time ◽

Comparable Result ◽

Video Encoding ◽

Visual Question Answering ◽

Proposed Model ◽

Ablation Study ◽

The Given ◽

Video Question Answering

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often timeconsuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Coattention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling “what words to listen to” (question attention). To the best of our knowledge, this is the first work of replacing RNNs with selfattention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.

Download Full-text