Fine-Grained Image Captioning with Global-Local Discriminative Objective

2019 ◽

Author(s):

Siying Wu ◽

Zheng-Jun Zha ◽

Zilei Wang ◽

Houqiang Li ◽

Feng Wu

Keyword(s):

Natural Language ◽

State Of The Art ◽

Cross Entropy ◽

Image Captioning ◽

Value Network ◽

Entropy Loss ◽

Fine Grained ◽

Performance Improvements ◽

Single Sentence ◽

Multiple State

Image paragraph generation aims to describe an image with a paragraph in natural language. Compared to image captioning with a single sentence, paragraph generation provides more expressive and fine-grained description for storytelling. Existing approaches mainly optimize paragraph generator towards minimizing word-wise cross entropy loss, which neglects linguistic hierarchy of paragraph and results in ``sparse" supervision for generator learning. In this paper, we propose a novel Densely Supervised Hierarchical Policy-Value (DHPV) network for effective paragraph generation. We design new hierarchical supervisions consisting of hierarchical rewards and values at both sentence and word levels. The joint exploration of hierarchical rewards and values provides dense supervision cues for learning effective paragraph generator. We propose a new hierarchical policy-value architecture which exploits compositionality at token-to-token and sentence-to-sentence levels simultaneously and can preserve the semantic and syntactic constituent integrity. Extensive experiments on the Stanford image-paragraph benchmark have demonstrated the effectiveness of the proposed DHPV approach with performance improvements over multiple state-of-the-art methods.

Download Full-text

High-Quality Image Captioning With Fine-Grained and Semantic-Guided Visual Attention

IEEE Transactions on Multimedia ◽

10.1109/tmm.2018.2888822 ◽

2019 ◽

Vol 21 (7) ◽

pp. 1681-1693 ◽

Cited By ~ 8

Author(s):

Zongjian Zhang ◽

Qiang Wu ◽

Yang Wang ◽

Fang Chen

Keyword(s):

Visual Attention ◽

Image Captioning ◽

High Quality ◽

Fine Grained ◽

High Quality Image ◽

Quality Image

Download Full-text

Exploring and Distilling Cross-Modal Information for Image Captioning

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/708 ◽

2019 ◽

Cited By ~ 2

Author(s):

Fenglin Liu ◽

Xuancheng Ren ◽

Yuanxin Liu ◽

Kai Lei ◽

Xu Sun

Keyword(s):

Image Understanding ◽

Great Difficulty ◽

Source Information ◽

Image Captioning ◽

Fine Grained ◽

Deep Image ◽

Word Selection ◽

Global And Local ◽

Accuracy Speed ◽

Vision And Language

Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. To perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our fully-attentive model achieves a CIDEr score of 129.3 in offline COCO evaluation with remarkable efficiency in terms of accuracy, speed, and parameter budget.

Download Full-text

c-RNN: A Fine-Grained Language Model for Image Captioning

Neural Processing Letters ◽

10.1007/s11063-018-9836-2 ◽

2018 ◽

Vol 49 (2) ◽

pp. 683-691

Author(s):

Gengshi Huang ◽

Haifeng Hu

Keyword(s):

Language Model ◽

Image Captioning ◽

Fine Grained

Download Full-text

Cascade Attention Fusion for Fine-Grained Image Captioning Based on Multi-Layer LSTM

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp39728.2021.9413691 ◽

2021 ◽

Author(s):

Shuang Wang ◽

Yun Meng ◽

Yu Gu ◽

Lei Zhang ◽

Xiutiao Ye ◽

...

Keyword(s):

Image Captioning ◽

Fine Grained

Download Full-text

Federated Learning for Vision-and-Language Grounding Problems

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6824 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11572-11579 ◽

Cited By ~ 1

Author(s):

Fenglin Liu ◽

Xian Wu ◽

Shen Ge ◽

Wei Fan ◽

Yuexian Zou

Keyword(s):

Transfer Learning ◽

Question Answering ◽

Image Captioning ◽

Relative Gain ◽

Fine Grained ◽

Learning Framework ◽

Recent Success ◽

Language Grounding ◽

Image Representations ◽

Vision And Language

Recently, vision-and-language grounding problems, e.g., image captioning and visual question answering (VQA), has attracted extensive interests from both academic and industrial worlds. However, given the similarity of these tasks, the efforts to obtain better results by combining the merits of their algorithms are not well studied. Inspired by the recent success of federated learning, we propose a federated learning framework to obtain various types of image representations from different tasks, which are then fused together to form fine-grained image representations. The representations merge useful features from different vision-and-language grounding problems, and are thus much more powerful than the original representations alone in individual tasks. To learn such image representations, we propose the Aligning, Integrating and Mapping Network (aimNet). The aimNet is validated on three federated learning settings, which include horizontal federated learning, vertical federated learning, and federated transfer learning. Experiments of aimNet-based federated learning framework on two representative tasks, i.e., image captioning and VQA, demonstrate the effective and universal improvements of all metrics over the baselines. In image captioning, we are able to get 14% and 13% relative gain on the task-specific metrics CIDEr and SPICE, respectively. In VQA, we could also boost the performance of strong baselines by up to 3%.

Download Full-text

Fine-Grained and Semantic-Guided Visual Attention for Image Captioning

2018 IEEE Winter Conference on Applications of Computer Vision (WACV) ◽

10.1109/wacv.2018.00190 ◽

2018 ◽

Cited By ~ 3

Author(s):

Zongjian Zhang ◽

Qiang Wu ◽

Yang Wang ◽

Fang Chen

Keyword(s):

Visual Attention ◽

Image Captioning ◽

Fine Grained

Download Full-text

REO-Relevance, Extraness, Omission: A Fine-grained Evaluation for Image Captioning

10.18653/v1/d19-1156 ◽

2019 ◽

Author(s):

Ming Jiang ◽

Junjie Hu ◽

Qiuyuan Huang ◽

Lei Zhang ◽

Jana Diesner ◽

...

Keyword(s):

Image Captioning ◽

Fine Grained

Download Full-text

Sequential Dual Attention: Coarse-to-Fine-Grained Hierarchical Generation for Image Captioning

Symmetry ◽

10.3390/sym10110626 ◽

2018 ◽

Vol 10 (11) ◽

pp. 626 ◽

Cited By ~ 1

Author(s):

Zhibin Guan ◽

Kang Liu ◽

Yan Ma ◽

Xu Qian ◽

Tongkai Ji

Keyword(s):

Artificial Intelligence ◽

Visual Information ◽

Coarse Grained ◽

Image Captioning ◽

Fine Grained ◽

The Core ◽

Benchmark Datasets ◽

Coarse To Fine ◽

Image Caption Generation ◽

Image Caption

Image caption generation is a fundamental task to build a bridge between image and its description in text, which is drawing increasing interest in artificial intelligence. Images and textual sentences are viewed as two different carriers of information, which are symmetric and unified in the same content of visual scene. The existing image captioning methods rarely consider generating a final description sentence in a coarse-grained to fine-grained way, which is how humans understand the surrounding scenes; and the generated sentence sometimes only describes coarse-grained image content. Therefore, we propose a coarse-to-fine-grained hierarchical generation method for image captioning, named SDA-CFGHG, to address the two problems above. The core of our SDA-CFGHG method is a sequential dual attention that is used to fuse different grained visual information with sequential means. The advantage of our SDA-CFGHG method is that it can achieve image captioning in a coarse-to-fine-grained way and the generated textual sentence can capture details of the raw image to some degree. Moreover, we validate the impressive performance of our method on benchmark datasets—MS COCO, Flickr—with several popular evaluation metrics—CIDEr, SPICE, METEOR, ROUGE-L, and BLEU.

Download Full-text

Panoptic Segmentation-Based Attention for Image Captioning

Applied Sciences ◽

10.3390/app10010391 ◽

2020 ◽

Vol 10 (1) ◽

pp. 391

Author(s):

Wenjie Cai ◽

Zheng Xiong ◽

Xianfang Sun ◽

Paul L. Rosin ◽

Longcun Jin ◽

...

Keyword(s):

Main Part ◽

State Of The Art ◽

Image Representation ◽

Experimental Results ◽

Competitive Performance ◽

Image Captioning ◽

Feature Vectors ◽

Fine Grained ◽

Art Methods

Image captioning is the task of generating textual descriptions of images. In order to obtain a better image representation, attention mechanisms have been widely adopted in image captioning. However, in existing models with detection-based attention, the rectangular attention regions are not fine-grained, as they contain irrelevant regions (e.g., background or overlapped regions) around the object, making the model generate inaccurate captions. To address this issue, we propose panoptic segmentation-based attention that performs attention at a mask-level (i.e., the shape of the main part of an instance). Our approach extracts feature vectors from the corresponding segmentation regions, which is more fine-grained than current attention mechanisms. Moreover, in order to process features of different classes independently, we propose a dual-attention module which is generic and can be applied to other frameworks. Experimental results showed that our model could recognize the overlapped objects and understand the scene better. Our approach achieved competitive performance against state-of-the-art methods. We made our code available.

Download Full-text

Fine-Grained Image Captioning with Global-Local Discriminative Objective

Densely Supervised Hierarchical Policy-Value Network for Image Paragraph Generation

High-Quality Image Captioning With Fine-Grained and Semantic-Guided Visual Attention

Exploring and Distilling Cross-Modal Information for Image Captioning

c-RNN: A Fine-Grained Language Model for Image Captioning

Cascade Attention Fusion for Fine-Grained Image Captioning Based on Multi-Layer LSTM

Federated Learning for Vision-and-Language Grounding Problems

Fine-Grained and Semantic-Guided Visual Attention for Image Captioning

REO-Relevance, Extraness, Omission: A Fine-grained Evaluation for Image Captioning

Sequential Dual Attention: Coarse-to-Fine-Grained Hierarchical Generation for Image Captioning

Panoptic Segmentation-Based Attention for Image Captioning

Export Citation Format