Reference Based on Adaptive Attention Mechanism for Image Captioning*

Significant progress has been made in remote sensing image captioning by encoder-decoder frameworks. The conventional attention mechanism is prevalent in this task but still has some drawbacks. The conventional attention mechanism only uses visual information about the remote sensing images without considering using the label information to guide the calculation of attention masks. To this end, a novel attention mechanism, namely Label-Attention Mechanism (LAM), is proposed in this paper. LAM additionally utilizes the label information of high-resolution remote sensing images to generate natural sentences to describe the given images. It is worth noting that, instead of high-level image features, the predicted categories’ word embedding vectors are adopted to guide the calculation of attention masks. Representing the content of images in the form of word embedding vectors can filter out redundant image features. In addition, it can also preserve pure and useful information for generating complete sentences. The experimental results from UCM-Captions, Sydney-Captions and RSICD demonstrate that LAM can improve the model’s performance for describing high-resolution remote sensing images and obtain better S m scores compared with other methods. S m score is a hybrid scoring method derived from the AI Challenge 2017 scoring method. In addition, the validity of LAM is verified by the experiment of using true labels.

Download Full-text

The Role of Attention Mechanism and Multi-Feature in Image Captioning

Proceedings of the 3rd International Conference on Machine Learning and Soft Computing - ICMLSC 2019 ◽

10.1145/3310986.3311002 ◽

2019 ◽

Author(s):

Tien X. Dang ◽

Aran Oh ◽

In-Seop Na ◽

Soo-Hyung Kim

Keyword(s):

Attention Mechanism ◽

Image Captioning

Download Full-text

MemCap: Memorizing Style Knowledge for Image Captioning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6998 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12984-12992 ◽

Cited By ~ 2

Author(s):

Wentian Zhao ◽

Xinxiao Wu ◽

Xiaoxun Zhang

Keyword(s):

Language Model ◽

Attention Mechanism ◽

Visual Content ◽

Image Captioning ◽

Memory Module ◽

Training Corpus ◽

Memory Mechanism ◽

Linguistic Style ◽

Related Part ◽

With Memory

Generating stylized captions for images is a challenging task since it requires not only describing the content of the image accurately but also expressing the desired linguistic style appropriately. In this paper, we propose MemCap, a novel stylized image captioning method that explicitly encodes the knowledge about linguistic styles with memory mechanism. Rather than relying heavily on a language model to capture style factors in existing methods, our method resorts to memorizing stylized elements learned from training corpus. Particularly, we design a memory module that comprises a set of embedding vectors for encoding style-related phrases in training corpus. To acquire the style-related phrases, we develop a sentence decomposing algorithm that splits a stylized sentence into a style-related part that reflects the linguistic style and a content-related part that contains the visual content. When generating captions, our MemCap first extracts content-relevant style knowledge from the memory module via an attention mechanism and then incorporates the extracted knowledge into a language model. Extensive experiments on two stylized image captioning datasets (SentiCap and FlickrStyle10K) demonstrate the effectiveness of our method.

Download Full-text

Image Description using Attention Mechanism

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1555.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 3290-3293

Keyword(s):

Image Understanding ◽

Attention Mechanism ◽

Evaluation Metrics ◽

Image Description ◽

Image Captioning ◽

Object Type ◽

Complete Understanding ◽

Textual Description ◽

Image Descriptions

Image Description involves generating a textual description of images which is essential for the problem of image understanding. The variable and ambiguous nature of possible image descriptions make this task challenging. There are different approaches for automated image captioning which explain the image contents along with a complete understanding of the image, rather than just simply classifying it into a particular object type. However, learning image contexts from the text and generating image descriptions similar to human's description requires to focus on important features of the image using attention mechanism. We provide an outline of the various recent works in image description models employing various attention mechanism. We present an analysis of the various approaches, datasets and evaluation metrics that are utilized for image description. We showcase a model using the encoder-decoder attention mechanism based on Flickr dataset and evaluate the performance using BLEU metrics.

Download Full-text

Bi-Directional Co-Attention Network for Image Captioning

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3460474 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-20

Author(s):

Weitao Jiang ◽

Weixuan Wang ◽

Haifeng Hu

Keyword(s):

A Priori ◽

Attention Mechanism ◽

Superior Performance ◽

Significant Advance ◽

Visual Features ◽

Image Captioning ◽

Top Down ◽

Bottom Up ◽

Attention Network ◽

Benchmark Datasets

Image Captioning, which automatically describes an image with natural language, is regarded as a fundamental challenge in computer vision. In recent years, significant advance has been made in image captioning through improving attention mechanism. However, most existing methods construct attention mechanisms based on singular visual features, such as patch features or object features, which limits the accuracy of generated captions. In this article, we propose a Bidirectional Co-Attention Network (BCAN) that combines multiple visual features to provide information from different aspects. Different features are associated with predicting different words, and there are a priori relations between these multiple visual features. Based on this, we further propose a bottom-up and top-down bi-directional co-attention mechanism to extract discriminative attention information. Furthermore, most existing methods do not exploit an effective multimodal integration strategy, generally using addition or concatenation to combine features. To solve this problem, we adopt the Multivariate Residual Module (MRM) to integrate multimodal attention features. Meanwhile, we further propose a Vertical MRM to integrate features of the same category, and a Horizontal MRM to combine features of the different categories, which can balance the contribution of the bottom-up co-attention and the top-down co-attention. In contrast to the existing methods, the BCAN is able to obtain complementary information from multiple visual features via the bi-directional co-attention strategy, and integrate multimodal information via the improved multivariate residual strategy. We conduct a series of experiments on two benchmark datasets (MSCOCO and Flickr30k), and the results indicate that the proposed BCAN achieves the superior performance.

Download Full-text

Structure Preserving Convolutional Attention for Image Captioning

Applied Sciences ◽

10.3390/app9142888 ◽

2019 ◽

Vol 9 (14) ◽

pp. 2888 ◽

Cited By ~ 1

Author(s):

Shichen Lu ◽

Ruimin Hu ◽

Jing Liu ◽

Longteng Guo ◽

Fei Zheng

Keyword(s):

Spatial Structure ◽

Spatial Attention ◽

Large Scale ◽

Attention Mechanism ◽

Vector Representation ◽

Image Captioning ◽

Feature Maps ◽

Structure Preserving ◽

Convolution Operation

In the task of image captioning, learning the attentive image regions is necessary to adaptively and precisely focus on the object semantics relevant to each decoded word. In this paper, we propose a convolutional attention module that can preserve the spatial structure of the image by performing the convolution operation directly on the 2D feature maps. The proposed attention mechanism contains two components: convolutional spatial attention and cross-channel attention, aiming to determine the intended regions to describe the image along the spatial and channel dimensions, respectively. Both of the two attentions are calculated at each decoding step. In order to preserve the spatial structure, instead of operating on the vector representation of each image grid, the two attention components are both computed directly on the entire feature maps with convolution operations. Experiments on two large-scale datasets (MSCOCO and Flickr30K) demonstrate the outstanding performance of our proposed method.

Download Full-text

Description Generation for Remote Sensing Images Using Attribute Attention Mechanism

Remote Sensing ◽

10.3390/rs11060612 ◽

2019 ◽

Vol 11 (6) ◽

pp. 612 ◽

Cited By ~ 17

Author(s):

Xiangrong Zhang ◽

Xin Wang ◽

Xu Tang ◽

Huiyu Zhou ◽

Chen Li

Keyword(s):

Remote Sensing ◽

Remote Sensing Image ◽

Image Understanding ◽

Attention Mechanism ◽

Semantic Description ◽

Remote Sensing Images ◽

Image Captioning ◽

Proposed Model ◽

High Level ◽

The Impact

Image captioning generates a semantic description of an image. It deals with image understanding and text mining, which has made great progress in recent years. However, it is still a great challenge to bridge the “semantic gap” between low-level features and high-level semantics in remote sensing images, in spite of the improvement of image resolutions. In this paper, we present a new model with an attribute attention mechanism for the description generation of remote sensing images. Therefore, we have explored the impact of the attributes extracted from remote sensing images on the attention mechanism. The results of our experiments demonstrate the validity of our proposed model. The proposed method obtains six higher scores and one slightly lower, compared against several state of the art techniques, on the Sydney Dataset and Remote Sensing Image Caption Dataset (RSICD), and receives all seven higher scores on the UCM Dataset for remote sensing image captioning, indicating that the proposed framework achieves robust performance for semantic description in high-resolution remote sensing images.

Download Full-text