scholarly journals Image Description using Attention Mechanism

2019 ◽  
Vol 8 (2S11) ◽  
pp. 3290-3293

Image Description involves generating a textual description of images which is essential for the problem of image understanding. The variable and ambiguous nature of possible image descriptions make this task challenging. There are different approaches for automated image captioning which explain the image contents along with a complete understanding of the image, rather than just simply classifying it into a particular object type. However, learning image contexts from the text and generating image descriptions similar to human's description requires to focus on important features of the image using attention mechanism. We provide an outline of the various recent works in image description models employing various attention mechanism. We present an analysis of the various approaches, datasets and evaluation metrics that are utilized for image description. We showcase a model using the encoder-decoder attention mechanism based on Flickr dataset and evaluate the performance using BLEU metrics.

Author(s):  
Santosh Kumar Mishra ◽  
Gaurav Rai ◽  
Sriparna Saha ◽  
Pushpak Bhattacharyya

Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.


2019 ◽  
Vol 11 (6) ◽  
pp. 612 ◽  
Author(s):  
Xiangrong Zhang ◽  
Xin Wang ◽  
Xu Tang ◽  
Huiyu Zhou ◽  
Chen Li

Image captioning generates a semantic description of an image. It deals with image understanding and text mining, which has made great progress in recent years. However, it is still a great challenge to bridge the “semantic gap” between low-level features and high-level semantics in remote sensing images, in spite of the improvement of image resolutions. In this paper, we present a new model with an attribute attention mechanism for the description generation of remote sensing images. Therefore, we have explored the impact of the attributes extracted from remote sensing images on the attention mechanism. The results of our experiments demonstrate the validity of our proposed model. The proposed method obtains six higher scores and one slightly lower, compared against several state of the art techniques, on the Sydney Dataset and Remote Sensing Image Caption Dataset (RSICD), and receives all seven higher scores on the UCM Dataset for remote sensing image captioning, indicating that the proposed framework achieves robust performance for semantic description in high-resolution remote sensing images.


2013 ◽  
Vol 47 ◽  
pp. 853-899 ◽  
Author(s):  
M. Hodosh ◽  
P. Young ◽  
J. Hockenmaier

The ability to associate images with natural language sentences that describe what is depicted in them is a hallmark of image understanding, and a prerequisite for applications such as sentence-based image search. In analogy to image search, we propose to frame sentence-based image annotation as the task of ranking a given pool of captions. We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. We introduce a number of systems that perform quite well on this task, even though they are only based on features that can be obtained with minimal supervision. Our results clearly indicate the importance of training on multiple captions per image, and of capturing syntactic (word order-based) and semantic features of these captions. We also perform an in-depth comparison of human and automatic evaluation metrics for this task, and propose strategies for collecting human judgments cheaply and on a very large scale, allowing us to augment our collection with additional relevance judgments of which captions describe which image. Our analysis shows that metrics that consider the ranked list of results for each query image or sentence are significantly more robust than metrics that are based on a single response per query. Moreover, our study suggests that the evaluation of ranking-based image description systems may be fully automated.


Author(s):  
Santosh Kumar Mishra ◽  
Rijul Dhir ◽  
Sriparna Saha ◽  
Pushpak Bhattacharyya

Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision is used for understanding images, and natural language processing is used for language modeling. A lot of works have been done for image captioning for the English language. In this article, we have developed a model for image captioning in the Hindi language. Hindi is the official language of India, and it is the fourth most spoken language in the world, spoken in India and South Asia. To the best of our knowledge, this is the first attempt to generate image captions in the Hindi language. A dataset is manually created by translating well known MSCOCO dataset from English to Hindi. Finally, different types of attention-based architectures are developed for image captioning in the Hindi language. These attention mechanisms are new for the Hindi language, as those have never been used for the Hindi language. The obtained results of the proposed model are compared with several baselines in terms of BLEU scores, and the results show that our model performs better than others. Manual evaluation of the obtained captions in terms of adequacy and fluency also reveals the effectiveness of our proposed approach. Availability of resources : The codes of the article are available at https://github.com/santosh1821cs03/Image_Captioning_Hindi_Language ; The dataset will be made available: http://www.iitp.ac.in/∼ai-nlp-ml/resources.html .


Author(s):  
Huimin Lu ◽  
Rui Yang ◽  
Zhenrong Deng ◽  
Yonglin Zhang ◽  
Guangwei Gao ◽  
...  

Chinese image description generation tasks usually have some challenges, such as single-feature extraction, lack of global information, and lack of detailed description of the image content. To address these limitations, we propose a fuzzy attention-based DenseNet-BiLSTM Chinese image captioning method in this article. In the proposed method, we first improve the densely connected network to extract features of the image at different scales and to enhance the model’s ability to capture the weak features. At the same time, a bidirectional LSTM is used as the decoder to enhance the use of context information. The introduction of an improved fuzzy attention mechanism effectively improves the problem of correspondence between image features and contextual information. We conduct experiments on the AI Challenger dataset to evaluate the performance of the model. The results show that compared with other models, our proposed model achieves higher scores in objective quantitative evaluation indicators, including BLEU , BLEU , METEOR, ROUGEl, and CIDEr. The generated description sentence can accurately express the image content.


2020 ◽  
Vol 167 ◽  
pp. 107329 ◽  
Author(s):  
Shiyang Yan ◽  
Yuan Xie ◽  
Fangyu Wu ◽  
Jeremy S. Smith ◽  
Wenjin Lu ◽  
...  

Author(s):  
Wei Zhao ◽  
Benyou Wang ◽  
Jianbo Ye ◽  
Min Yang ◽  
Zhou Zhao ◽  
...  

In this paper, we propose a Multi-task Learning Approach for Image Captioning (MLAIC ), motivated by the fact that humans have no difficulty performing such task because they possess capabilities of multiple domains. Specifically, MLAIC consists of three key components: (i) A multi-object classification model that learns rich category-aware image representations using a CNN image encoder; (ii) A syntax generation model that learns better syntax-aware LSTM based decoder; (iii) An image captioning model that generates image descriptions in text, sharing its CNN encoder and LSTM decoder with the object classification task and the syntax generation task, respectively. In particular, the image captioning model can benefit from the additional object categorization and syntax knowledge. To verify the effectiveness of our approach, we conduct extensive experiments on MS-COCO dataset. The experimental results demonstrate that our model achieves impressive results compared to other strong competitors.


2018 ◽  
Vol 8 (10) ◽  
pp. 1850 ◽  
Author(s):  
Zhibin Guan ◽  
Kang Liu ◽  
Yan Ma ◽  
Xu Qian ◽  
Tongkai Ji

Image caption generation is attractive research which focuses on generating natural language sentences to describe the visual content of a given image. It is an interdisciplinary subject combining computer vision (CV) and natural language processing (NLP). The existing image captioning methods are mainly focused on generating the final image caption directly, which may lose significant identification information of objects contained in the raw image. Therefore, we propose a new middle-level attribute-based language retouching (MLALR) method to solve this problem. Our proposed MLALR method uses the middle-level attributes predicted from the object regions to retouch the intermediate image description, which is generated by our language generation model. The advantage of our MLALR method is that it can correct descriptive errors in the intermediate image description and make the final image caption more accurate. Moreover, evaluation using benchmark datasets—MSCOCO, Flickr8K, and Flickr30K—validated the impressive performance of our MLALR method with evaluation metrics—BLEU, METEOR, ROUGE-L, CIDEr, and SPICE.


Sign in / Sign up

Export Citation Format

Share Document