Explaining transformer-based image captioning models: An empirical analysis

Image Captioning is the task of translating an input image into a textual description. As such, it connects Vision and Language in a generative fashion, with applications that range from multi-modal search engines to help visually impaired people. Although recent years have witnessed an increase in accuracy in such models, this has also brought increasing complexity and challenges in interpretability and visualization. In this work, we focus on Transformer-based image captioning models and provide qualitative and quantitative tools to increase interpretability and assess the grounding and temporal alignment capabilities of such models. Firstly, we employ attribution methods to visualize what the model concentrates on in the input image, at each step of the generation. Further, we propose metrics to evaluate the temporal alignment between model predictions and attribution scores, which allows measuring the grounding capabilities of the model and spot hallucination flaws. Experiments are conducted on three different Transformer-based architectures, employing both traditional and Vision Transformer-based visual features.

Download Full-text

A Hindi Image Caption Generation Framework Using Deep Learning

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3432246 ◽

2021 ◽

Vol 20 (2) ◽

pp. 1-19

Author(s):

Santosh Kumar Mishra ◽

Rijul Dhir ◽

Sriparna Saha ◽

Pushpak Bhattacharyya

Keyword(s):

Computer Vision ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

English Language ◽

Image Captioning ◽

Textual Description ◽

Proposed Model ◽

Hindi Language ◽

The Given

Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision is used for understanding images, and natural language processing is used for language modeling. A lot of works have been done for image captioning for the English language. In this article, we have developed a model for image captioning in the Hindi language. Hindi is the official language of India, and it is the fourth most spoken language in the world, spoken in India and South Asia. To the best of our knowledge, this is the first attempt to generate image captions in the Hindi language. A dataset is manually created by translating well known MSCOCO dataset from English to Hindi. Finally, different types of attention-based architectures are developed for image captioning in the Hindi language. These attention mechanisms are new for the Hindi language, as those have never been used for the Hindi language. The obtained results of the proposed model are compared with several baselines in terms of BLEU scores, and the results show that our model performs better than others. Manual evaluation of the obtained captions in terms of adequacy and fluency also reveals the effectiveness of our proposed approach. Availability of resources : The codes of the article are available at https://github.com/santosh1821cs03/Image_Captioning_Hindi_Language ; The dataset will be made available: http://www.iitp.ac.in/∼ai-nlp-ml/resources.html .

Download Full-text

Improved Framework using Rider Optimization Algorithm for Precise Image Caption Generation

International Journal of Image and Graphics ◽

10.1142/s0219467822500218 ◽

2021 ◽

pp. 2250021

Author(s):

Chaitrali Prasanna Chaudhari ◽

Satish Devane

Keyword(s):

Language Processing ◽

Optimization Algorithm ◽

Short Term Memory ◽

Activation Function ◽

Input Image ◽

Image Captioning ◽

Word Generation ◽

Adopted Model ◽

Conventional Models ◽

Image Caption

“Image Captioning is the process of generating a textual description of an image”. It deploys both computer vision and natural language processing for caption generation. However, the majority of the image captioning systems offer unclear depictions regarding the objects like “man”, “woman”, “group of people”, “building”, etc. Hence, this paper intends to develop an intelligent-based image captioning model. The adopted model comprises of few steps like word generation, sentence formation, and caption generation. Initially, the input image is subjected to the Deep learning classifier called Convolutional Neural Network (CNN). Since the classifier is already trained in the relevant words that are related to all images, it can easily classify the associated words of the given image. Further, a set of sentences is formed with the generated words using Long-Short Term Memory (LSTM) model. The likelihood of the formed sentences is computed using the Maximum Likelihood (ML) function, and the sentences with higher probability are taken, which is further used for generating the visual representation of the scene in terms of image caption. As a major novelty, this paper aims to enhance the performance of CNN by optimally tuning its weight and activation function. This paper introduces a new enhanced optimization algorithm Rider with Randomized Bypass and Over-taker update (RR-BOU) for this optimal selection. In the proposed RR-BOU is the enhanced version of the Rider Optimization Algorithm (ROA). Finally, the performance of the proposed captioning model is compared over other conventional models with respect to statistical analysis.

Download Full-text

Variational Autoencoder-Based Multiple Image Captioning Using a Caption Attention Map

Applied Sciences ◽

10.3390/app9132699 ◽

2019 ◽

Vol 9 (13) ◽

pp. 2699 ◽

Cited By ~ 4

Author(s):

Boeun Kim ◽

Saim Shin ◽

Hyedong Jung

Keyword(s):

Generative Models ◽

Research Topic ◽

Video Data ◽

Image Feature ◽

Image Captioning ◽

Multiple Image ◽

Visually Impaired People ◽

Proposed Model ◽

Variational Autoencoder ◽

Latent Distribution

Image captioning is a promising research topic that is applicable to services that search for desired content in a large amount of video data and a situation explanation service for visually impaired people. Previous research on image captioning has been focused on generating one caption per image. However, to increase usability in applications, it is necessary to generate several different captions that contain various representations for an image. We propose a method to generate multiple captions using a variational autoencoder, which is one of the generative models. Because an image feature plays an important role when generating captions, a method to extract a Caption Attention Map (CAM) of the image is proposed, and CAMs are projected to a latent distribution. In addition, methods for the evaluation of multiple image captioning tasks are proposed that have not yet been actively researched. The proposed model outperforms in the aspect of diversity compared with the base model when the accuracy is comparable. Moreover, it is verified that the model using CAM generates detailed captions describing various content in the image.

Download Full-text

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

Communications in Computer and Information Science - Internet Multimedia Computing and Service ◽

10.1007/978-981-10-8530-7_8 ◽

2018 ◽

pp. 75-84

Author(s):

Yanwu Shu ◽

Liyan Zhang ◽

Zechao Li ◽

Jinhui Tang

Keyword(s):

Neural Networks ◽

Recurrent Neural Networks ◽

Visual Features ◽

Image Captioning

Download Full-text

Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking

KI - Künstliche Intelligenz ◽

10.1007/s13218-020-00679-2 ◽

2020 ◽

Vol 34 (4) ◽

pp. 571-584

Author(s):

Rajarshi Biswas ◽

Michael Barz ◽

Daniel Sonntag

Keyword(s):

State Of The Art ◽

Input Image ◽

The State ◽

Beam Search ◽

Image Captioning ◽

Bottom Up ◽

Interactive Machine Learning ◽

Joint Embedding ◽

Bounding Boxes ◽

High Level

AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.

Download Full-text

Saliency Cuts: Salient Region Extraction based on Local Adaptive Thresholding for Image Information Recognition of the Visually Impaired

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/5/4 ◽

2020 ◽

Vol 17 (5) ◽

pp. 713-720

Author(s):

Mukhriddin Mukhiddinov ◽

Rag-Gyo Jeong ◽

Jinsoo Cho

Keyword(s):

Visually Impaired ◽

Saliency Map ◽

Input Image ◽

Adaptive Thresholding ◽

Binary Mask ◽

Salient Object ◽

Full Resolution ◽

Visually Impaired People ◽

Region Extraction ◽

Edge Detection Method

In recent years, there has been an increased scope for assistive software and technologies, which help the visually impaired to perceive and recognize natural scene images. In this article, we propose a novel saliency cuts approach using local adaptive thresholding to obtain four regions from a given saliency map. The saliency cuts approach is an effective tool for salient object detection. First, we produce four regions for image segmentation using a saliency map as an input image and applying an automatic threshold operation. Second, the four regions are used to initialize an iterative version of the Grab Cut algorithm and to produce a robust and high-quality binary mask with a full resolution. Lastly, based on the binary mask and extracted salient object, outer boundaries and internal edges are detected by Canny edge detection method. Extensive experiments demonstrate that the proposed method correctly detects and extracts the main contents of the image sequences for delivering visually salient information to the visually impaired people compared to the results of existing salient object segmentation algorithms

Download Full-text

Image Caption Generator

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c8383.0110321 ◽

2021 ◽

Vol 10 (3) ◽

pp. 87-92

Author(s):

Megha J Panicker ◽

Vikas Upadhayay ◽

Gunjan Sethi ◽

Vrinda Mathur

Keyword(s):

Computer Vision ◽

Language Processing ◽

Network Models ◽

Image Captioning ◽

Neural Network Models ◽

Visually Impaired People ◽

Video Frames ◽

Modern Era ◽

Deep Learning Model ◽

Image Caption

In the modern era, image captioning has become one of the most widely required tools. Moreover, there are inbuilt applications that generate and provide a caption for a certain image, all these things are done with the help of deep neural network models. The process of generating a description of an image is called image captioning. It requires recognizing the important objects, their attributes, and the relationships among the objects in an image. It generates syntactically and semantically correct sentences.In this paper, we present a deep learning model to describe images and generate captions using computer vision and machine translation. This paper aims to detect different objects found in an image, recognize the relationships between those objects and generate captions. The dataset used is Flickr8k and the programming language used was Python3, and an ML technique called Transfer Learning will be implemented with the help of the Xception model, to demonstrate the proposed experiment. This paper will also elaborate on the functions and structure of the various Neural networks involved. Generating image captions is an important aspect of Computer Vision and Natural language processing. Image caption generators can find applications in Image segmentation as used by Facebook and Google Photos, and even more so, its use can be extended to video frames. They will easily automate the job of a person who has to interpret images. Not to mention it has immense scope in helping visually impaired people.

Download Full-text

Exploring and Distilling Cross-Modal Information for Image Captioning

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/708 ◽

2019 ◽

Cited By ~ 2

Author(s):

Fenglin Liu ◽

Xuancheng Ren ◽

Yuanxin Liu ◽

Kai Lei ◽

Xu Sun

Keyword(s):

Image Understanding ◽

Great Difficulty ◽

Source Information ◽

Image Captioning ◽

Fine Grained ◽

Deep Image ◽

Word Selection ◽

Global And Local ◽

Accuracy Speed ◽

Vision And Language

Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. To perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our fully-attentive model achieves a CIDEr score of 129.3 in offline COCO evaluation with remarkable efficiency in terms of accuracy, speed, and parameter budget.

Download Full-text

Depth Estimation and Semantic Segmentation from a Single RGB Image Using a Hybrid Convolutional Neural Network

Sensors ◽

10.3390/s19081795 ◽

2019 ◽

Vol 19 (8) ◽

pp. 1795 ◽

Cited By ~ 5

Author(s):

Xiao Lin ◽

Dalila Sánchez-Escobedo ◽

Josep R. Casas ◽

Montse Pardàs

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

State Of The Art ◽

Semantic Segmentation ◽

Depth Estimation ◽

Input Image ◽

Estimation Accuracy ◽

Single Task ◽

Qualitative And Quantitative ◽

Highly Correlated

Semantic segmentation and depth estimation are two important tasks in computer vision, and many methods have been developed to tackle them. Commonly these two tasks are addressed independently, but recently the idea of merging these two problems into a sole framework has been studied under the assumption that integrating two highly correlated tasks may benefit each other to improve the estimation accuracy. In this paper, depth estimation and semantic segmentation are jointly addressed using a single RGB input image under a unified convolutional neural network. We analyze two different architectures to evaluate which features are more relevant when shared by the two tasks and which features should be kept separated to achieve a mutual improvement. Likewise, our approaches are evaluated under two different scenarios designed to review our results versus single-task and multi-task methods. Qualitative and quantitative experiments demonstrate that the performance of our methodology outperforms the state of the art on single-task approaches, while obtaining competitive results compared with other multi-task methods.

Download Full-text

Image Captioning: Generating Textual Description Using Modified Beam Search

SSRN Electronic Journal ◽

10.2139/ssrn.3991811 ◽

2021 ◽

Author(s):

Bagesh Kumar ◽

Divyansh Rai ◽

Arpit Agarwal ◽

Suhaib Khan ◽

Shourya S ◽

...

Keyword(s):

Beam Search ◽

Image Captioning ◽

Textual Description

Download Full-text