Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image Captioning

The task of image captioning involves the generation of a sentence that can describe an image appropriately, which is the intersection of computer vision and natural language. Although the research on remote sensing image captions has just started, it has great significance. The attention mechanism is inspired by the way humans think, which is widely used in remote sensing image caption tasks. However, the attention mechanism currently used in this task is mainly aimed at images, which is too simple to express such a complex task well. Therefore, in this paper, we propose a multi-level attention model, which is a closer imitation of attention mechanisms of human beings. This model contains three attention structures, which represent the attention to different areas of the image, the attention to different words, and the attention to vision and semantics. Experiments show that our model has achieved better results than before, which is currently state-of-the-art. In addition, the existing datasets for remote sensing image captioning contain a large number of errors. Therefore, in this paper, a lot of work has been done to modify the existing datasets in order to promote the research of remote sensing image captioning.

Download Full-text

Toward Remote Sensing Image Retrieval Under a Deep Image Captioning Perspective

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ◽

10.1109/jstars.2020.3013818 ◽

2020 ◽

Vol 13 ◽

pp. 4462-4475 ◽

Cited By ~ 1

Author(s):

Genc Hoxha ◽

Farid Melgani ◽

Begum Demir

Keyword(s):

Remote Sensing ◽

Image Retrieval ◽

Remote Sensing Image ◽

Image Captioning ◽

Deep Image

Download Full-text

VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning

IEEE Access ◽

10.1109/access.2019.2942154 ◽

2019 ◽

Vol 7 ◽

pp. 137355-137364 ◽

Cited By ~ 1

Author(s):

Zhengyuan Zhang ◽

Wenkai Zhang ◽

Wenhui Diao ◽

Menglong Yan ◽

Xin Gao ◽

...

Keyword(s):

Remote Sensing ◽

Remote Sensing Image ◽

Image Captioning ◽

Attention Model

Download Full-text

Remote Sensing Image Captioning with Continuous Output Neural Models

10.1145/3474717.3483631 ◽

2021 ◽

Author(s):

Rita Ramos ◽

Bruno Martins

Keyword(s):

Remote Sensing ◽

Remote Sensing Image ◽

Neural Models ◽

Image Captioning ◽

Continuous Output

Download Full-text

Research On The Classification Of High Resolution Image Based On Object-oriented And Class Rule

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsarchives-xl-7-w4-75-2015 ◽

2015 ◽

Vol XL-7/W4 ◽

pp. 75-80 ◽

Cited By ~ 1

Author(s):

C. K. Li ◽

W. Fang ◽

X. J. Dong

Keyword(s):

Remote Sensing ◽

High Resolution ◽

Image Classification ◽

Object Oriented ◽

Remote Sensing Data ◽

Image Data ◽

Remote Sensing Image ◽

Classification Method ◽

Multi Level ◽

Image Object

With the development of remote sensing technology, the spatial resolution, spectral resolution and time resolution of remote sensing data is greatly improved. How to efficiently process and interpret the massive high resolution remote sensing image data for ground objects, which with spatial geometry and texture information, has become the focus and difficulty in the field of remote sensing research. An object oriented and rule of the classification method of remote sensing data has presents in this paper. Through the discovery and mining the rich knowledge of spectrum and spatial characteristics of high-resolution remote sensing image, establish a multi-level network image object segmentation and classification structure of remote sensing image to achieve accurate and fast ground targets classification and accuracy assessment. Based on worldview-2 image data in the Zangnan area as a study object, using the object-oriented image classification method and rules to verify the experiment which is combination of the mean variance method, the maximum area method and the accuracy comparison to analysis, selected three kinds of optimal segmentation scale and established a multi-level image object network hierarchy for image classification experiments. The results show that the objectoriented rules classification method to classify the high resolution images, enabling the high resolution image classification results similar to the visual interpretation of the results and has higher classification accuracy. The overall accuracy and Kappa coefficient of the object-oriented rules classification method were 97.38%, 0.9673; compared with object-oriented SVM method, respectively higher than 6.23%, 0.078; compared with object-oriented KNN method, respectively more than 7.96%, 0.0996. The extraction precision and user accuracy of the building compared with object-oriented SVM method, respectively higher than 18.39%, 3.98%, respectively better than the object-oriented KNN method 21.27%, 14.97%.

Download Full-text

LAM: Remote Sensing Image Captioning with Label-Attention Mechanism

Remote Sensing ◽

10.3390/rs11202349 ◽

2019 ◽

Vol 11 (20) ◽

pp. 2349 ◽

Cited By ~ 2

Author(s):

Zhengyuan Zhang ◽

Wenhui Diao ◽

Wenkai Zhang ◽

Menglong Yan ◽

Xin Gao ◽

...

Keyword(s):

Remote Sensing ◽

High Resolution ◽

Remote Sensing Image ◽

Image Features ◽

Attention Mechanism ◽

Word Embedding ◽

Remote Sensing Images ◽

Image Captioning ◽

Scoring Method ◽

Label Information

Significant progress has been made in remote sensing image captioning by encoder-decoder frameworks. The conventional attention mechanism is prevalent in this task but still has some drawbacks. The conventional attention mechanism only uses visual information about the remote sensing images without considering using the label information to guide the calculation of attention masks. To this end, a novel attention mechanism, namely Label-Attention Mechanism (LAM), is proposed in this paper. LAM additionally utilizes the label information of high-resolution remote sensing images to generate natural sentences to describe the given images. It is worth noting that, instead of high-level image features, the predicted categories’ word embedding vectors are adopted to guide the calculation of attention masks. Representing the content of images in the form of word embedding vectors can filter out redundant image features. In addition, it can also preserve pure and useful information for generating complete sentences. The experimental results from UCM-Captions, Sydney-Captions and RSICD demonstrate that LAM can improve the model’s performance for describing high-resolution remote sensing images and obtain better S m scores compared with other methods. S m score is a hybrid scoring method derived from the AI Challenge 2017 scoring method. In addition, the validity of LAM is verified by the experiment of using true labels.

Download Full-text

Boosting Memory with a Persistent Memory Mechanism for Remote Sensing Image Captioning

Remote Sensing ◽

10.3390/rs12111874 ◽

2020 ◽

Vol 12 (11) ◽

pp. 1874

Author(s):

Kun Fu ◽

Yang Li ◽

Wenkai Zhang ◽

Hongfeng Yu ◽

Xian Sun

Keyword(s):

Remote Sensing ◽

Remote Sensing Image ◽

External Memory ◽

Image Captioning ◽

Time Step ◽

Current Time ◽

Input Information ◽

Memory Mechanism ◽

Persistent Memory

The encoder–decoder framework has been widely used in the remote sensing image captioning task. When we need to extract remote sensing images containing specific characteristics from the described sentences for research, rich sentences can improve the final extraction results. However, the Long Short-Term Memory (LSTM) network used in decoders still loses some information in the picture over time when the generated caption is long. In this paper, we present a new model component named the Persistent Memory Mechanism (PMM), which can expand the information storage capacity of LSTM with an external memory. The external memory is a memory matrix with a predetermined size. It can store all the hidden layer vectors of LSTM before the current time step. Thus, our method can effectively solve the above problem. At each time step, the PMM searches previous information related to the input information at the current time from the external memory. Then the PMM will process the captured long-term information and predict the next word with the current information. In addition, it updates its memory with the input information. This method can pick up the long-term information missed from the LSTM but useful to the caption generation. By applying this method to image captioning, our CIDEr scores on datasets UCM-Captions, Sydney-Captions, and RSICD increased by 3%, 5%, and 7%, respectively.

Download Full-text