Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

2019 ◽  
Vol 32 (24) ◽  
pp. 17899-17908 ◽  
Author(s):  
Neeraj Gupta ◽  
Anand Singh Jalal
Author(s):  
Siying Wu ◽  
Zheng-Jun Zha ◽  
Zilei Wang ◽  
Houqiang Li ◽  
Feng Wu

Image paragraph generation aims to describe an image with a paragraph in natural language. Compared to image captioning with a single sentence, paragraph generation provides more expressive and fine-grained description for storytelling. Existing approaches mainly optimize paragraph generator towards minimizing word-wise cross entropy loss, which neglects linguistic hierarchy of paragraph and results in ``sparse" supervision for generator learning. In this paper, we propose a novel Densely Supervised Hierarchical Policy-Value (DHPV) network for effective paragraph generation. We design new hierarchical supervisions consisting of hierarchical rewards and values at both sentence and word levels. The joint exploration of hierarchical rewards and values provides dense supervision cues for learning effective paragraph generator. We propose a new hierarchical policy-value architecture which exploits compositionality at token-to-token and sentence-to-sentence levels simultaneously and can preserve the semantic and syntactic constituent integrity. Extensive experiments on the Stanford image-paragraph benchmark have demonstrated the effectiveness of the proposed DHPV approach with performance improvements over multiple state-of-the-art methods.


2019 ◽  
Vol 21 (7) ◽  
pp. 1681-1693 ◽  
Author(s):  
Zongjian Zhang ◽  
Qiang Wu ◽  
Yang Wang ◽  
Fang Chen

2020 ◽  
pp. 1-1 ◽  
Author(s):  
Jie Wu ◽  
Tianshui Chen ◽  
Hefeng Wu ◽  
Zhi Yang ◽  
Guangchun Luo ◽  
...  

2020 ◽  
Author(s):  
Bharathwaaj venkatesan ◽  
Ravinder Kaur Sond

Multi modality Image captioning


Author(s):  
Fenglin Liu ◽  
Xuancheng Ren ◽  
Yuanxin Liu ◽  
Kai Lei ◽  
Xu Sun

Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. To perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our fully-attentive model achieves a CIDEr score of 129.3 in offline COCO evaluation with remarkable efficiency in terms of accuracy, speed, and parameter budget.


2018 ◽  
Vol 49 (2) ◽  
pp. 683-691
Author(s):  
Gengshi Huang ◽  
Haifeng Hu

2020 ◽  
Author(s):  
Bharathwaaj venkatesan ◽  
Ravinder Kaur Sond

Multi modality Image captioning


2020 ◽  
Vol 34 (07) ◽  
pp. 11572-11579 ◽  
Author(s):  
Fenglin Liu ◽  
Xian Wu ◽  
Shen Ge ◽  
Wei Fan ◽  
Yuexian Zou

Recently, vision-and-language grounding problems, e.g., image captioning and visual question answering (VQA), has attracted extensive interests from both academic and industrial worlds. However, given the similarity of these tasks, the efforts to obtain better results by combining the merits of their algorithms are not well studied. Inspired by the recent success of federated learning, we propose a federated learning framework to obtain various types of image representations from different tasks, which are then fused together to form fine-grained image representations. The representations merge useful features from different vision-and-language grounding problems, and are thus much more powerful than the original representations alone in individual tasks. To learn such image representations, we propose the Aligning, Integrating and Mapping Network (aimNet). The aimNet is validated on three federated learning settings, which include horizontal federated learning, vertical federated learning, and federated transfer learning. Experiments of aimNet-based federated learning framework on two representative tasks, i.e., image captioning and VQA, demonstrate the effective and universal improvements of all metrics over the baselines. In image captioning, we are able to get 14% and 13% relative gain on the task-specific metrics CIDEr and SPICE, respectively. In VQA, we could also boost the performance of strong baselines by up to 3%.


Sign in / Sign up

Export Citation Format

Share Document