Reasoning with Heterogeneous Graph Alignment for Video Question Answering

Pin Jiang; Yahong Han

doi:10.1609/aaai.v34i07.6767

Reasoning with Heterogeneous Graph Alignment for Video Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6767 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11109-11116 ◽

Cited By ~ 1

Author(s):

Pin Jiang ◽

Yahong Han

Keyword(s):

Network Architecture ◽

Question Answering ◽

The Other ◽

Late Fusion ◽

Fine Grained ◽

Graph Alignment ◽

Modal Reasoning ◽

Benchmark Datasets ◽

Ablation Study ◽

Video Question Answering

The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism. They usually process video and question separately, then feed the representations of different modalities into following late fusion networks. Although these methods use information of one modality to boost the other, they neglect to integrate correlations of both inter- and intra-modality in an uniform module. We propose a deep heterogeneous graph alignment network over the video shots and question words. Furthermore, we explore the network architecture from four steps: representation, fusion, alignment, and reasoning. Within our network, the inter- and intra-modality information can be aligned and interacted simultaneously over the heterogeneous graph and used for cross-modal reasoning. We evaluate our method on three benchmark datasets and conduct extensive ablation study to the effectiveness of the network architecture. Experiments show the network to be superior in quality.

Download Full-text

Feature Augmented Memory with Global Attention Network for VideoQA

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/139 ◽

2020 ◽

Author(s):

Jiayin Cai ◽

Chun Yuan ◽

Cheng Shi ◽

Lei Li ◽

Yangyang Cheng ◽

...

Keyword(s):

Question Answering ◽

Temporal Order ◽

State Of The Art ◽

Memory Capacity ◽

Coarse Grain ◽

Fine Grained ◽

Feature Representations ◽

Benchmark Datasets ◽

High Level ◽

Video Question Answering

Recently, Recurrent Neural Network (RNN) based methods and Self-Attention (SA) based methods have achieved promising performance in Video Question Answering (VideoQA). Despite the success of these works, RNN-based methods tend to forget the global semantic contents due to the inherent drawbacks of the recurrent units themselves, while SA-based methods cannot precisely capture the dependencies of the local neighborhood, leading to insufficient modeling for temporal order. To tackle these problems, we propose a novel VideoQA framework which progressively refines the representations of videos and questions from fine to coarse grain in a sequence-sensitive manner. Specifically, our model improves the feature representations via the following two steps: (1) introducing two fine-grained feature-augmented memories to strengthen the information augmentation of video and text which can improve memory capacity by memorizing more relevant and targeted information. (2) appending the self-attention and co-attention module to the memory output thus the module is able to capture global interaction between high-level semantic informations. Experimental results show that our approach achieves state-of-the-art performance on VideoQA benchmark datasets.

Download Full-text

Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018658 ◽

2019 ◽

Vol 33 ◽

pp. 8658-8665 ◽

Cited By ~ 10

Author(s):

Xiangpeng Li ◽

Jingkuan Song ◽

Lianli Gao ◽

Xianglong Liu ◽

Wenbing Huang ◽

...

Keyword(s):

Question Answering ◽

State Of The Art ◽

Computation Time ◽

Comparable Result ◽

Video Encoding ◽

Visual Question Answering ◽

Proposed Model ◽

Ablation Study ◽

The Given ◽

Video Question Answering

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often timeconsuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Coattention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling “what words to listen to” (question attention). To the best of our knowledge, this is the first work of replacing RNNs with selfattention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.

Download Full-text

Enhancement of Target-Oriented Opinion Words Extraction with Multiview-Trained Machine Reading Comprehension Model

Computational Intelligence and Neuroscience ◽

10.1155/2021/6645871 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Jingyuan Zhang ◽

Zequn Zhang ◽

Zhi Guo ◽

Li Jin ◽

Kang Liu ◽

...

Keyword(s):

Reading Comprehension ◽

Question Answering ◽

Opinion Mining ◽

Common Knowledge ◽

Multiple Perspectives ◽

Fine Grained ◽

Proposed Model ◽

Meta Learning ◽

Benchmark Datasets ◽

Machine Reading

Target-oriented opinion words extraction (TOWE) seeks to identify opinion expressions oriented to a specific target, and it is a crucial step toward fine-grained opinion mining. Recent neural networks have achieved significant success in this task by building target-aware representations. However, there are still two limitations of these methods that hinder the progress of TOWE. Mainstream approaches typically utilize position indicators to mark the given target, which is a naive strategy and lacks task-specific semantic meaning. Meanwhile, the annotated target-opinion pairs contain rich latent structural knowledge from multiple perspectives, but existing methods only exploit the TOWE view. To tackle these issues, we formulate the TOWE task as a question answering (QA) problem and leverage a machine reading comprehension (MRC) model trained with a multiview paradigm to extract targeted opinions. Specifically, we introduce a template-based pseudo-question generation method and utilize deep attention interaction to build target-aware context representations and extract related opinion words. To take advantage of latent structural correlations, we further cast the opinion-target structure into three distinct yet correlated views and leverage meta-learning to aggregate common knowledge among them to enhance the TOWE task. We evaluate the proposed model on four benchmark datasets, and our method achieves new state-of-the-art results. Extensional experiments have shown that the pipeline method with our approach could surpass existing opinion pair extraction models, including joint methods that are usually believed to work better.

Download Full-text

ADD: Attention-Based DeepFake Detection Approach

Big Data and Cognitive Computing ◽

10.3390/bdcc5040049 ◽

2021 ◽

Vol 5 (4) ◽

pp. 49

Author(s):

Aminollah Khormali ◽

Jiann-Shiun Yuan

Keyword(s):

Digital Media ◽

Network Architecture ◽

Data Augmentation ◽

Input Image ◽

Detection Methods ◽

Generative Adversarial Networks ◽

Fine Grained ◽

Detection Model ◽

Benchmark Datasets ◽

Main Components

Recent advancements of Generative Adversarial Networks (GANs) pose emerging yet serious privacy risks threatening digital media’s integrity and trustworthiness, specifically digital video, through synthesizing hyper-realistic images and videos, i.e., DeepFakes. The need for ascertaining the trustworthiness of digital media calls for automatic yet accurate DeepFake detection algorithms. This paper presents an attention-based DeepFake detection (ADD) method that exploits the fine-grained and spatial locality attributes of artificially synthesized videos for enhanced detection. ADD framework is composed of two main components including face close-up and face shut-off data augmentation methods and is applicable to any classifier based on convolutional neural network architecture. ADD first locates potentially manipulated areas of the input image to extract representative features. Second, the detection model is forced to pay more attention to these forgery regions in the decision-making process through a particular focus on interpreting the sample in the learning phase. ADD’s performance is evaluated against two challenging datasets of DeepFake forensics, i.e., Celeb-DF (V2) and WildDeepFake. We demonstrated the generalization of ADD by evaluating four popular classifiers, namely VGGNet, ResNet, Xception, and MobileNet. The obtained results demonstrate that ADD can boost the detection performance of all four baseline classifiers significantly on both benchmark datasets. Particularly, ADD with ResNet backbone detects DeepFakes with more than 98.3% on Celeb-DF (V2), outperforming state-of-the-art DeepFake detection methods.

Download Full-text

Knowledge Graph Question Answering Using Graph-Pattern Isomorphism

10.3233/ssw210038 ◽

2021 ◽

Author(s):

Daniel Vollmers ◽

Rricha Jalota ◽

Diego Moussallem ◽

Hardik Topiwala ◽

Axel-Cyrille Ngonga Ngomo ◽

...

Keyword(s):

Question Answering ◽

State Of The Art ◽

Machine Learning Algorithms ◽

Training Data ◽

Knowledge Graph ◽

Fine Grained ◽

Art Performance ◽

Training Examples ◽

Ablation Study ◽

Basic Graph

Knowledge Graph Question Answering (KGQA) systems are often based on machine learning algorithms, requiring thousands of question-answer pairs as training examples or natural language processing pipelines that need module fine-tuning. In this paper, we present a novel QA approach, dubbed TeBaQA. Our approach learns to answer questions based on graph isomorphisms from basic graph patterns of SPARQL queries. Learning basic graph patterns is efficient due to the small number of possible patterns. This novel paradigm reduces the amount of training data necessary to achieve state-of-the-art performance. TeBaQA also speeds up the domain adaption process by transforming the QA system development task into a much smaller and easier data compilation task. In our evaluation, TeBaQA achieves state-of-the-art performance on QALD-8 and delivers comparable results on QALD-9 and LC-QuAD v1. Additionally, we performed a fine-grained evaluation on complex queries that deal with aggregation and superlative questions as well as an ablation study, highlighting future research challenges.

Download Full-text

Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey

IEEE Access ◽

10.1109/access.2021.3058248 ◽

2021 ◽

Vol 9 ◽

pp. 43799-43823

Author(s):

Khushboo Khurana ◽

Umesh Deshpande

Keyword(s):

Question Answering ◽

Evaluation Metrics ◽

Video Captioning ◽

Benchmark Datasets ◽

Comprehensive Survey ◽

Video Question Answering

Download Full-text

Step-Wise Hierarchical Alignment Network for Image-Text Matching

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/106 ◽

2021 ◽

Author(s):

Zhong Ji ◽

Kexin Chen ◽

Haoran Wang

Keyword(s):

Single Step ◽

Local Alignment ◽

Global Alignment ◽

Progressive Alignment ◽

Fine Grained ◽

Level Information ◽

Modal Reasoning ◽

Benchmark Datasets ◽

Semantic Alignment ◽

Text Matching

Image-text matching plays a central role in bridging the semantic gap between vision and language. The key point to achieve precise visual-semantic alignment lies in capturing the fine-grained cross-modal correspondence between image and text. Most previous methods rely on single-step reasoning to discover the visual-semantic interactions, which lacks the ability of exploiting the multi-level information to locate the hierarchical fine-grained relevance. Different from them, in this work, we propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process. Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially. This progressive alignment strategy supplies our model with more complementary and sufficient semantic clues to understand the hierarchical correlations between image and text. The experimental results on two benchmark datasets demonstrate the superiority of our proposed method.

Download Full-text

Integrating Video Retrieval and Moment Detection in a Unified Corpus for Video Question Answering

10.21437/interspeech.2019-1736 ◽

2019 ◽

Author(s):

Hongyin Luo ◽

Mitra Mohtarami ◽

James Glass ◽

Karthik Krishnamurthy ◽

Brigitte Richardson

Keyword(s):

Question Answering ◽

Video Retrieval ◽

Video Question Answering

Download Full-text

Drug-Drug Interaction Predicting by Neural Network Using Integrated Similarity

Scientific Reports ◽

10.1038/s41598-019-50121-3 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 10

Author(s):

Narjes Rohani ◽

Changiz Eslahchi

Keyword(s):

Neural Network ◽

Drug Interaction ◽

Side Effect ◽

Network Architecture ◽

Selection Process ◽

Superior Performance ◽

Multiple Drug ◽

Interaction Prediction ◽

Benchmark Datasets ◽

Drug Drug Interaction

Abstract Drug-Drug Interaction (DDI) prediction is one of the most critical issues in drug development and health. Proposing appropriate computational methods for predicting unknown DDI with high precision is challenging. We proposed "NDD: Neural network-based method for drug-drug interaction prediction" for predicting unknown DDIs using various information about drugs. Multiple drug similarities based on drug substructure, target, side effect, off-label side effect, pathway, transporter, and indication data are calculated. At first, NDD uses a heuristic similarity selection process and then integrates the selected similarities with a nonlinear similarity fusion method to achieve high-level features. Afterward, it uses a neural network for interaction prediction. The similarity selection and similarity integration parts of NDD have been proposed in previous studies of other problems. Our novelty is to combine these parts with new neural network architecture and apply these approaches in the context of DDI prediction. We compared NDD with six machine learning classifiers and six state-of-the-art graph-based methods on three benchmark datasets. NDD achieved superior performance in cross-validation with AUPR ranging from 0.830 to 0.947, AUC from 0.954 to 0.994 and F-measure from 0.772 to 0.902. Moreover, cumulative evidence in case studies on numerous drug pairs, further confirm the ability of NDD to predict unknown DDIs. The evaluations corroborate that NDD is an efficient method for predicting unknown DDIs. The data and implementation of NDD are available at https://github.com/nrohani/NDD.

Download Full-text

Real-Time Instance Segmentation of Traffic Videos for Embedded Devices

Sensors ◽

10.3390/s21010275 ◽

2021 ◽

Vol 21 (1) ◽

pp. 275

Author(s):

Ruben Panero Martinez ◽

Ionut Schiopu ◽

Bruno Cornelis ◽

Adrian Munteanu

Keyword(s):

Real Time ◽

Network Architecture ◽

Training Procedure ◽

Segmentation Method ◽

Embedded Devices ◽

Network Training ◽

Assignment Algorithm ◽

Ablation Study ◽

Reduced Rate ◽

Instance Segmentation

The paper proposes a novel instance segmentation method for traffic videos devised for deployment on real-time embedded devices. A novel neural network architecture is proposed using a multi-resolution feature extraction backbone and improved network designs for the object detection and instance segmentation branches. A novel post-processing method is introduced to ensure a reduced rate of false detection by evaluating the quality of the output masks. An improved network training procedure is proposed based on a novel label assignment algorithm. An ablation study on speed-vs.-performance trade-off further modifies the two branches and replaces the conventional ResNet-based performance-oriented backbone with a lightweight speed-oriented design. The proposed architectural variations achieve real-time performance when deployed on embedded devices. The experimental results demonstrate that the proposed instance segmentation method for traffic videos outperforms the you only look at coefficients algorithm, the state-of-the-art real-time instance segmentation method. The proposed architecture achieves qualitative results with 31.57 average precision on the COCO dataset, while its speed-oriented variations achieve speeds of up to 66.25 frames per second on the Jetson AGX Xavier module.

Download Full-text