vision and language
Recently Published Documents


TOTAL DOCUMENTS

151
(FIVE YEARS 93)

H-INDEX

14
(FIVE YEARS 5)

2021 ◽  
Author(s):  
Tahiya Chowdhury ◽  
Qizhen Ding ◽  
Ilan Mandel ◽  
Wendy Ju ◽  
Jorge Ortiz

2021 ◽  
pp. 1-19
Author(s):  
Marcella Cornia ◽  
Lorenzo Baraldi ◽  
Rita Cucchiara

Image Captioning is the task of translating an input image into a textual description. As such, it connects Vision and Language in a generative fashion, with applications that range from multi-modal search engines to help visually impaired people. Although recent years have witnessed an increase in accuracy in such models, this has also brought increasing complexity and challenges in interpretability and visualization. In this work, we focus on Transformer-based image captioning models and provide qualitative and quantitative tools to increase interpretability and assess the grounding and temporal alignment capabilities of such models. Firstly, we employ attribution methods to visualize what the model concentrates on in the input image, at each step of the generation. Further, we propose metrics to evaluate the temporal alignment between model predictions and attribution scores, which allows measuring the grounding capabilities of the model and spot hallucination flaws. Experiments are conducted on three different Transformer-based architectures, employing both traditional and Vision Transformer-based visual features.


2021 ◽  
Author(s):  
Dong An ◽  
Yuankai Qi ◽  
Yan Huang ◽  
Qi Wu ◽  
Liang Wang ◽  
...  
Keyword(s):  

2021 ◽  
Author(s):  
Yuhao Cui ◽  
Zhou Yu ◽  
Chunqi Wang ◽  
Zhongzhou Zhao ◽  
Ji Zhang ◽  
...  

2021 ◽  
Author(s):  
Masoud Monajatipoor ◽  
Mozhdeh Rouhsedaghat ◽  
Liunian Harold Li ◽  
Aichi Chien ◽  
C.-C. Jay Kuo ◽  
...  

2021 ◽  
Author(s):  
Yifeng Jiang ◽  
Michelle Guo ◽  
Jiangshan Li ◽  
Ioannis Exarchos ◽  
Jiajun Wu ◽  
...  
Keyword(s):  

2021 ◽  
Vol 71 ◽  
pp. 1183-1317
Author(s):  
Aditya Mogadala ◽  
Marimuthu Kalimuthu ◽  
Dietrich Klakow

Interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing. Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. This has created significant interest in the integration of vision and language. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulation, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. Furthermore, we also provide some potential future directions in this field of research with an anticipation that this survey stimulates innovative thoughts and ideas to address the existing challenges and build new applications.


Author(s):  
Shagun Uppal ◽  
Sarthak Bhagat ◽  
Devamanyu Hazarika ◽  
Navonil Majumder ◽  
Soujanya Poria ◽  
...  

Author(s):  
Chenyu Gao ◽  
Qi Zhu ◽  
Peng Wang ◽  
Qi Wu

Vision-and-Language (VL) pre-training has shown great potential on many related downstream tasks, such as Visual Question Answering (VQA), one of the most popular problems in the VL field. All of these pre-trained models (such as VisualBERT, ViLBERT, LXMERT and UNITER) are built with Transformer, which extends the classical attention mechanism to multiple layers and heads. To investigate why and how these models work on VQA so well, in this paper we explore the roles of individual heads and layers in Transformer models when handling 12 different types of questions. Specifically, we manually remove (chop) heads (or layers) from a pre-trained VisualBERT model at a time, and test it on different levels of questions to record its performance. As shown in the interesting echelon shape of the result matrices, experiments reveal different heads and layers are responsible for different question types, with higher-level layers activated by higher-level visual reasoning questions. Based on this observation, we design a dynamic chopping module that can automatically remove heads and layers of the VisualBERT at an instance level when dealing with different questions. Our dynamic chopping module can effectively reduce the parameters of the original model by 50%, while only damaging the accuracy by less than 1% on the VQA task.


Sign in / Sign up

Export Citation Format

Share Document