scholarly journals Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering

2021 ◽  
Vol 9 ◽  
pp. 195-210
Author(s):  
Ben Bogin ◽  
Sanjay Subramanian ◽  
Matt Gardner ◽  
Jonathan Berant

Abstract Answering questions that involve multi-step reasoning requires decomposing them and using the answers of intermediate steps to reach the final answer. However, state-of-the-art models in grounded question answering often do not explicitly perform decomposition, leading to difficulties in generalization to out-of-distribution examples. In this work, we propose a model that computes a representation and denotation for all question spans in a bottom-up, compositional manner using a CKY-style parser. Our model induces latent trees, driven by end-to-end (the answer) supervision only. We show that this inductive bias towards tree structures dramatically improves systematic generalization to out-of- distribution examples, compared to strong baselines on an arithmetic expressions benchmark as well as on C losure, a dataset that focuses on systematic generalization for grounded question answering. On this challenging dataset, our model reaches an accuracy of 96.1%, significantly higher than prior models that almost perfectly solve the task on a random, in-distribution split.

2020 ◽  
Vol 34 (07) ◽  
pp. 11021-11028 ◽  
Author(s):  
Deng Huang ◽  
Peihao Chen ◽  
Runhao Zeng ◽  
Qing Du ◽  
Mingkui Tan ◽  
...  

We addressed the challenging task of video question answering, which requires machines to answer questions about videos in a natural language form. Previous state-of-the-art methods attempt to apply spatio-temporal attention mechanism on video frame features without explicitly modeling the location and relations among object interaction occurred in videos. However, the relations between object interaction and their location information are very critical for both action recognition and question reasoning. In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction. Here, each node is associated with an object represented by its appearance and location features. Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action. As the graph is built on objects, our method is able to focus on the foreground action contents for better video question answering. Lastly, we leverage an attention mechanism to combine the output of graph convolution and encoded question features for final answer reasoning. Extensive experiments demonstrate the effectiveness of the proposed methods. Specifically, our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA and MSVD-QA datasets.


Author(s):  
Ghulam Ahmed Ansari ◽  
Amrita Saha ◽  
Vishwajeet Kumar ◽  
Mohan Bhambhani ◽  
Karthik Sankaranarayanan ◽  
...  

Neural Program Induction (NPI) is a paradigm for decomposing high-level tasks such as complex question-answering over knowledge bases (KBQA) into executable programs by employing neural models. Typically, this involves two key phases: i) inferring input program variables from the high-level task description, and ii) generating the correct program sequence involving these variables. Here we focus on NPI for Complex KBQA with only the final answer as supervision, and not gold programs. This raises major challenges; namely, i) noisy query annotation in the absence of any supervision can lead to catastrophic forgetting while learning, ii) reward becomes extremely sparse owing to the noise. To deal with these, we propose a noise-resilient NPI model, Stable Sparse Reward based Programmer (SSRP) that evades noise-induced instability through continual retrospection and its comparison with current learning behavior. On complex KBQA datasets, SSRP performs at par with hand-crafted rule-based models when provided with gold program input, and in the noisy settings outperforms state-of-the-art models by a significant margin even with a noisier query annotator.


2020 ◽  
Vol 34 (05) ◽  
pp. 7879-7886
Author(s):  
Darryl Hannan ◽  
Akshay Jain ◽  
Mohit Bansal

We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities: text, images, and tables. We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs. Our questions are ambiguous, in that the modality that contains the answer is not easily determined based solely upon the question. To demonstrate this ambiguity, we construct a modality selector (or disambiguator) network, and this model gets substantially lower accuracy on our challenge set, compared to existing datasets, indicating that our questions are more ambiguous. By analyzing this model, we investigate which words in the question are indicative of the modality. Next, we construct a simple baseline ManyModalQA model, which, based on the prediction from the modality selector, fires a corresponding pre-trained state-of-the-art unimodal QA model. We focus on providing the community with a new manymodal evaluation set and only provide a fine-tuning set, with the expectation that existing datasets and approaches will be transferred for most of the training, to encourage low-resource generalization without large, monolithic training sets for each new task. There is a significant gap between our baseline models and human performance; therefore, we hope that this challenge encourages research in end-to-end modality disambiguation and multimodal QA models, as well as transfer learning.


Author(s):  
Sanket Shah ◽  
Anand Mishra ◽  
Naganand Yadati ◽  
Partha Pratim Talukdar

Visual Question Answering (VQA) has emerged as an important problem spanning Computer Vision, Natural Language Processing and Artificial Intelligence (AI). In conventional VQA, one may ask questions about an image which can be answered purely based on its content. For example, given an image with people in it, a typical VQA question may inquire about the number of people in the image. More recently, there is growing interest in answering questions which require commonsense knowledge involving common nouns (e.g., cats, dogs, microphones) present in the image. In spite of this progress, the important problem of answering questions requiring world knowledge about named entities (e.g., Barack Obama, White House, United Nations) in the image has not been addressed in prior research. We address this gap in this paper, and introduce KVQA – the first dataset for the task of (world) knowledge-aware VQA. KVQA consists of 183K question-answer pairs involving more than 18K named entities and 24K images. Questions in this dataset require multi-entity, multi-relation, and multi-hop reasoning over large Knowledge Graphs (KG) to arrive at an answer. To the best of our knowledge, KVQA is the largest dataset for exploring VQA over KG. Further, we also provide baseline performances using state-of-the-art methods on KVQA.


2020 ◽  
Vol 34 (05) ◽  
pp. 9701-9708
Author(s):  
Haoxi Zhong ◽  
Chaojun Xiao ◽  
Cunchao Tu ◽  
Tianyang Zhang ◽  
Zhiyuan Liu ◽  
...  

We present JEC-QA, the largest question answering dataset in the legal domain, collected from the National Judicial Examination of China. The examination is a comprehensive evaluation of professional skills for legal practitioners. College students are required to pass the examination to be certified as a lawyer or a judge. The dataset is challenging for existing question answering methods, because both retrieving relevant materials and answering questions require the ability of logic reasoning. Due to the high demand of multiple reasoning abilities to answer legal questions, the state-of-the-art models can only achieve about 28% accuracy on JEC-QA, while skilled humans and unskilled humans can reach 81% and 64% accuracy respectively, which indicates a huge gap between humans and machines on this task. We will release JEC-QA and our baselines to help improve the reasoning ability of machine comprehension models. You can access the dataset from http://jecqa.thunlp.org/.


Author(s):  
Zihao Zhu ◽  
Jing Yu ◽  
Yujing Wang ◽  
Yajing Sun ◽  
Yue Hu ◽  
...  

Fact-based Visual Question Answering (FVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing FVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the final answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. In this paper, we depict an image by a multi-modal heterogeneous graph, which contains multiple layers of information corresponding to the visual, semantic and factual features. On top of the multi-layer graph representations, we propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question. Specifically, the intra-modal graph convolution selects evidence from each modality and cross-modal graph convolution aggregates relevant information across different graph layers. By stacking this process multiple times, our model performs iterative reasoning across three modalities and predicts the optimal answer by analyzing all question-oriented evidence. We achieve a new state-of-the-art performance on the FVQA task and demonstrate the effectiveness and interpretability of our model with extensive experiments.


Author(s):  
Fei Liu ◽  
Jing Liu ◽  
Zhiwei Fang ◽  
Richang Hong ◽  
Hanqing Lu

Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of interactions, which may be not enough to model latent complex image-question relations that are necessary for accurately answering questions. Therefore, in this paper, we propose a novel DCAF (Densely Connected Attention Flow) framework for modeling dense interactions. It densely connects all pairwise layers of the network via Attention Connectors, capturing fine-grained interplay between image and question across all hierarchical levels. The proposed Attention Connector efficiently connects the multi-modal features at any two layers with symmetric co-attention, and produces interaction-aware attention features. Experimental results on three publicly available datasets show that the proposed method achieves state-of-the-art performance.


2021 ◽  
Vol 39 (2) ◽  
pp. 1-26
Author(s):  
Shen Gao ◽  
Xiuying Chen ◽  
Zhaochun Ren ◽  
Dongyan Zhao ◽  
Rui Yan

In e-commerce portals, generating answers for product-related questions has become a crucial task. In this article, we focus on the task of product-aware answer generation , which learns to generate an accurate and complete answer from large-scale unlabeled e-commerce reviews and product attributes. However, safe answer problems (i.e., neural models tend to generate meaningless and universal answers) pose significant challenges to text generation tasks, and e-commerce question-answering task is no exception. To generate more meaningful answers, in this article, we propose a novel generative neural model, called the Meaningful Product Answer Generator ( MPAG ), which alleviates the safe answer problem by taking product reviews, product attributes, and a prototype answer into consideration. Product reviews and product attributes are used to provide meaningful content, while the prototype answer can yield a more diverse answer pattern. To this end, we propose a novel answer generator with a review reasoning module and a prototype answer reader. Our key idea is to obtain the correct question-aware information from a large-scale collection of reviews and learn how to write a coherent and meaningful answer from an existing prototype answer. To be more specific, we propose a read-and-write memory consisting of selective writing units to conduct reasoning among these reviews . We then employ a prototype reader consisting of comprehensive matching to extract the answer skeleton from the prototype answer. Finally, we propose an answer editor to generate the final answer by taking the question and the above parts as input. Conducted on a real-world dataset collected from an e-commerce platform, extensive experimental results show that our model achieves state-of-the-art performance in terms of both automatic metrics and human evaluations. Human evaluation also demonstrates that our model can consistently generate specific and proper answers.


2021 ◽  
Vol 11 (15) ◽  
pp. 6975
Author(s):  
Tao Zhang ◽  
Lun He ◽  
Xudong Li ◽  
Guoqing Feng

Lipreading aims to recognize sentences being spoken by a talking face. In recent years, the lipreading method has achieved a high level of accuracy on large datasets and made breakthrough progress. However, lipreading is still far from being solved, and existing methods tend to have high error rates on the wild data and have the defects of disappearing training gradient and slow convergence. To overcome these problems, we proposed an efficient end-to-end sentence-level lipreading model, using an encoder based on a 3D convolutional network, ResNet50, Temporal Convolutional Network (TCN), and a CTC objective function as the decoder. More importantly, the proposed architecture incorporates TCN as a feature learner to decode feature. It can partly eliminate the defects of RNN (LSTM, GRU) gradient disappearance and insufficient performance, and this yields notable performance improvement as well as faster convergence. Experiments show that the training and convergence speed are 50% faster than the state-of-the-art method, and improved accuracy by 2.4% on the GRID dataset.


2021 ◽  
Vol 11 (9) ◽  
pp. 4241
Author(s):  
Jiahua Wu ◽  
Hyo Jong Lee

In bottom-up multi-person pose estimation, grouping joint candidates into the appropriately structured corresponding instance of a person is challenging. In this paper, a new bottom-up method, the Partitioned CenterPose (PCP) Network, is proposed to better cluster the detected joints. To achieve this goal, we propose a novel approach called Partition Pose Representation (PPR) which integrates the instance of a person and its body joints based on joint offset. PPR leverages information about the center of the human body and the offsets between that center point and the positions of the body’s joints to encode human poses accurately. To enhance the relationships between body joints, we divide the human body into five parts, and then, we generate a sub-PPR for each part. Based on this PPR, the PCP Network can detect people and their body joints simultaneously, then group all body joints according to joint offset. Moreover, an improved l1 loss is designed to more accurately measure joint offset. Using the COCO keypoints and CrowdPose datasets for testing, it was found that the performance of the proposed method is on par with that of existing state-of-the-art bottom-up methods in terms of accuracy and speed.


Sign in / Sign up

Export Citation Format

Share Document