scholarly journals Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

Author(s):  
Xiangpeng Li ◽  
Jingkuan Song ◽  
Lianli Gao ◽  
Xianglong Liu ◽  
Wenbing Huang ◽  
...  

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often timeconsuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Coattention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling “what words to listen to” (question attention). To the best of our knowledge, this is the first work of replacing RNNs with selfattention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.

Author(s):  
Zihao Zhu ◽  
Jing Yu ◽  
Yujing Wang ◽  
Yajing Sun ◽  
Yue Hu ◽  
...  

Fact-based Visual Question Answering (FVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing FVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the final answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. In this paper, we depict an image by a multi-modal heterogeneous graph, which contains multiple layers of information corresponding to the visual, semantic and factual features. On top of the multi-layer graph representations, we propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question. Specifically, the intra-modal graph convolution selects evidence from each modality and cross-modal graph convolution aggregates relevant information across different graph layers. By stacking this process multiple times, our model performs iterative reasoning across three modalities and predicts the optimal answer by analyzing all question-oriented evidence. We achieve a new state-of-the-art performance on the FVQA task and demonstrate the effectiveness and interpretability of our model with extensive experiments.


Author(s):  
Xinmeng Li ◽  
Mamoun Alazab ◽  
Qian Li ◽  
Keping Yu ◽  
Quanjun Yin

AbstractKnowledge graph question answering is an important technology in intelligent human–robot interaction, which aims at automatically giving answer to human natural language question with the given knowledge graph. For the multi-relation question with higher variety and complexity, the tokens of the question have different priority for the triples selection in the reasoning steps. Most existing models take the question as a whole and ignore the priority information in it. To solve this problem, we propose question-aware memory network for multi-hop question answering, named QA2MN, to update the attention on question timely in the reasoning process. In addition, we incorporate graph context information into knowledge graph embedding model to increase the ability to represent entities and relations. We use it to initialize the QA2MN model and fine-tune it in the training process. We evaluate QA2MN on PathQuestion and WorldCup2014, two representative datasets for complex multi-hop question answering. The result demonstrates that QA2MN achieves state-of-the-art Hits@1 accuracy on the two datasets, which validates the effectiveness of our model.


2020 ◽  
Vol 34 (07) ◽  
pp. 13041-13049 ◽  
Author(s):  
Luowei Zhou ◽  
Hamid Palangi ◽  
Lei Zhang ◽  
Houdong Hu ◽  
Jason Corso ◽  
...  

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.


2019 ◽  
Vol 2019 ◽  
pp. 1-12
Author(s):  
Canghong Jin ◽  
Zhiwei Lin ◽  
Minghui Wu

Human trajectory prediction is an essential task for various applications such as travel recommendation, location-sensitive advertisement, and traffic planning. Most existing approaches are sequential-model based and produce a prediction by mining behavior patterns. However, the effectiveness of pattern-based methods is not as good as expected in real-life conditions, such as data sparse or data missing. Moreover, due to the technical limitations of sensors or the traffic situation at the given time, people going to the same place may produce different trajectories. Even for people traveling along the same route, the observed transit records are not exactly the same. Therefore trajectories are always diverse, and extracting user intention from trajectories is difficult. In this paper, we propose an augmented-intention recurrent neural network (AI-RNN) model to predict locations in diverse trajectories. We first propose three strategies to generate graph structures to demonstrate travel context and then leverage graph convolutional networks to augment user travel intentions under graph view. Finally, we use gated recurrent units with augmented node vectors to predict human trajectories. We experiment with two representative real-life datasets and evaluate the performance of the proposed model by comparing its results with those of other state-of-the-art models. The results demonstrate that the AI-RNN model outperforms other methods in terms of top-k accuracy, especially in scenarios with low similarity.


Author(s):  
Zhou Zhao ◽  
Qifan Yang ◽  
Deng Cai ◽  
Xiaofei He ◽  
Yueting Zhuang

Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the temporal dynamics of video contents. In this paper, we consider the problem of open-ended video question answering from the viewpoint of spatio-temporal attentional encoder-decoder learning framework. We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question. We then develop the encoder-decoder learning method with reasoning recurrent neural networks for open-ended video question answering. We construct a large-scale video question answering dataset. The extensive experiments show the effectiveness of our method.


Author(s):  
Yang Liu ◽  
Quanxue Gao ◽  
Zhaohua Yang ◽  
Shujian Wang

Due to the importance and efficiency of learning complex structures hidden in data, graph-based methods have been widely studied and get successful in unsupervised learning. Generally, most existing graph-based clustering methods require post-processing on the original data graph to extract the clustering indicators. However, there are two drawbacks with these methods: (1) the cluster structures are not explicit in the clustering results; (2) the final clustering performance is sensitive to the construction of the original data graph. To solve these problems, in this paper, a novel learning model is proposed to learn a graph based on the given data graph such that the new obtained optimal graph is more suitable for the clustering task. We also propose an efficient algorithm to solve the model. Extensive experimental results illustrate that the proposed model outperforms other state-of-the-art clustering algorithms.


2020 ◽  
Vol 53 (8) ◽  
pp. 5705-5745
Author(s):  
Sruthy Manmadhan ◽  
Binsu C. Kovoor

2020 ◽  
Vol 34 (07) ◽  
pp. 11021-11028 ◽  
Author(s):  
Deng Huang ◽  
Peihao Chen ◽  
Runhao Zeng ◽  
Qing Du ◽  
Mingkui Tan ◽  
...  

We addressed the challenging task of video question answering, which requires machines to answer questions about videos in a natural language form. Previous state-of-the-art methods attempt to apply spatio-temporal attention mechanism on video frame features without explicitly modeling the location and relations among object interaction occurred in videos. However, the relations between object interaction and their location information are very critical for both action recognition and question reasoning. In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction. Here, each node is associated with an object represented by its appearance and location features. Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action. As the graph is built on objects, our method is able to focus on the foreground action contents for better video question answering. Lastly, we leverage an attention mechanism to combine the output of graph convolution and encoded question features for final answer reasoning. Extensive experiments demonstrate the effectiveness of the proposed methods. Specifically, our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA and MSVD-QA datasets.


2021 ◽  
Author(s):  
Pufen Zhang ◽  
Hong Lan

Abstract In recently years, some visual question answering (VQA) methods that emphasize the simultaneous understanding of both the context of image and question have been proposed. Despite the effectiveness of these methods, they fail to explore a more comprehensive and generalized context learning tactics. To address this issue, we propose a novel Multiple Context Learning Networks (MCLN) to model the multiple contexts for VQA. Three kinds of contexts are investigated, namely visual context, textual context and a special visual-textual context that ignored by previous methods. Moreover, three corresponding context learning modules are proposed. These modules endow image and text representations with context-aware information based on a uniform context learning strategy. And they work together to form a multiple context learning layer (MCL). Such MCL can be stacked in depth and which describe high-level context information by associating intra-modal contexts with inter-modal context. On the VQA v2.0 datasets, the proposed model achieves 71.05% and 71.48% on test-dev set and test-std set respectively, and gains better performance than the previous state-of-the-art methods. In addition, extensive ablation studies have been carried out to examine the effectiveness of the proposed method.


2020 ◽  
Vol 34 (07) ◽  
pp. 11109-11116 ◽  
Author(s):  
Pin Jiang ◽  
Yahong Han

The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism. They usually process video and question separately, then feed the representations of different modalities into following late fusion networks. Although these methods use information of one modality to boost the other, they neglect to integrate correlations of both inter- and intra-modality in an uniform module. We propose a deep heterogeneous graph alignment network over the video shots and question words. Furthermore, we explore the network architecture from four steps: representation, fusion, alignment, and reasoning. Within our network, the inter- and intra-modality information can be aligned and interacted simultaneously over the heterogeneous graph and used for cross-modal reasoning. We evaluate our method on three benchmark datasets and conduct extensive ablation study to the effectiveness of the network architecture. Experiments show the network to be superior in quality.


Sign in / Sign up

Export Citation Format

Share Document