Show, Recall, and Tell: Image Captioning with Recall Mechanism

Li Wang; Zechen Bai; Yonghua Zhang; Hongtao Lu

doi:10.1609/aaai.v34i07.6898

Show, Recall, and Tell: Image Captioning with Recall Mechanism

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6898 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12176-12183

Author(s):

Li Wang ◽

Zechen Bai ◽

Yonghua Zhang ◽

Hongtao Lu

Keyword(s):

State Of The Art ◽

Text Retrieval ◽

Text Summarization ◽

Cross Entropy ◽

Image Captioning ◽

Soft Switch ◽

Entropy Loss ◽

Human Conduct ◽

Art Methods ◽

The Way

Generating natural and accurate descriptions in image captioning has always been a challenge. In this paper, we propose a novel recall mechanism to imitate the way human conduct captioning. There are three parts in our recall mechanism : recall unit, semantic guide (SG) and recalled-word slot (RWS). Recall unit is a text-retrieval module designed to retrieve recalled words for images. SG and RWS are designed for the best use of recalled words. SG branch can generate a recalled context, which can guide the process of generating caption. RWS branch is responsible for copying recalled words to the caption. Inspired by pointing mechanism in text summarization, we adopt a soft switch to balance the generated-word probabilities between SG and RWS. In the CIDEr optimization step, we also introduce an individual recalled-word reward (WR) to boost training. Our proposed methods (SG+RWS+WR) achieve BLEU-4 / CIDEr / SPICE scores of 36.6 / 116.9 / 21.3 with cross-entropy loss and 38.7 / 129.1 / 22.4 with CIDEr optimization on MSCOCO Karpathy test split, which surpass the results of other state-of-the-art methods.

Download Full-text

Densely Supervised Hierarchical Policy-Value Network for Image Paragraph Generation

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/137 ◽

2019 ◽

Author(s):

Siying Wu ◽

Zheng-Jun Zha ◽

Zilei Wang ◽

Houqiang Li ◽

Feng Wu

Keyword(s):

Natural Language ◽

State Of The Art ◽

Cross Entropy ◽

Image Captioning ◽

Value Network ◽

Entropy Loss ◽

Fine Grained ◽

Performance Improvements ◽

Single Sentence ◽

Multiple State

Image paragraph generation aims to describe an image with a paragraph in natural language. Compared to image captioning with a single sentence, paragraph generation provides more expressive and fine-grained description for storytelling. Existing approaches mainly optimize paragraph generator towards minimizing word-wise cross entropy loss, which neglects linguistic hierarchy of paragraph and results in ``sparse" supervision for generator learning. In this paper, we propose a novel Densely Supervised Hierarchical Policy-Value (DHPV) network for effective paragraph generation. We design new hierarchical supervisions consisting of hierarchical rewards and values at both sentence and word levels. The joint exploration of hierarchical rewards and values provides dense supervision cues for learning effective paragraph generator. We propose a new hierarchical policy-value architecture which exploits compositionality at token-to-token and sentence-to-sentence levels simultaneously and can preserve the semantic and syntactic constituent integrity. Extensive experiments on the Stanford image-paragraph benchmark have demonstrated the effectiveness of the proposed DHPV approach with performance improvements over multiple state-of-the-art methods.

Download Full-text

Truncation Cross Entropy Loss for Remote Sensing Image Captioning

IEEE Transactions on Geoscience and Remote Sensing ◽

10.1109/tgrs.2020.3010106 ◽

2020 ◽

pp. 1-12

Author(s):

Xuelong Li ◽

Xueting Zhang ◽

Wei Huang ◽

Qi Wang

Keyword(s):

Remote Sensing ◽

Remote Sensing Image ◽

Cross Entropy ◽

Image Captioning ◽

Entropy Loss

Download Full-text

Introduction to the Migration from Legacy Applications to Service Provisioning

Migrating Legacy Applications ◽

10.4018/978-1-4666-2488-7.ch001 ◽

2012 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Anca Daniela Ionita

Keyword(s):

State Of The Art ◽

Service Provisioning ◽

Business Perspective ◽

Art Methods ◽

Big Picture ◽

Service Oriented ◽

Legacy Applications ◽

The Way

This chapter presents the fundamental ideas related to migrating legacy applications to service-oriented systems, and provides an overview of the available approaches that are presented in this book. The goal is to provide a “big picture” while also analyzing each chapter and indicating the way it covers several essential concerns, such as state-of-the-art, methods, standards, tools, business perspective, practical experiments, strategies, and roadmaps.

Download Full-text

Multimodal Image Captioning Through Combining Reinforced Cross Entropy Loss and Stochastic Deprecation

2019 IEEE International Conference on Multimedia and Expo (ICME) ◽

10.1109/icme.2019.00229 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xi Meng ◽

Hao Kong ◽

Dongqi Tang ◽

Tong Lu

Keyword(s):

Cross Entropy ◽

Image Captioning ◽

Entropy Loss

Download Full-text

Panoptic Segmentation-Based Attention for Image Captioning

Applied Sciences ◽

10.3390/app10010391 ◽

2020 ◽

Vol 10 (1) ◽

pp. 391

Author(s):

Wenjie Cai ◽

Zheng Xiong ◽

Xianfang Sun ◽

Paul L. Rosin ◽

Longcun Jin ◽

...

Keyword(s):

Main Part ◽

State Of The Art ◽

Image Representation ◽

Experimental Results ◽

Competitive Performance ◽

Image Captioning ◽

Feature Vectors ◽

Fine Grained ◽

Art Methods

Image captioning is the task of generating textual descriptions of images. In order to obtain a better image representation, attention mechanisms have been widely adopted in image captioning. However, in existing models with detection-based attention, the rectangular attention regions are not fine-grained, as they contain irrelevant regions (e.g., background or overlapped regions) around the object, making the model generate inaccurate captions. To address this issue, we propose panoptic segmentation-based attention that performs attention at a mask-level (i.e., the shape of the main part of an instance). Our approach extracts feature vectors from the corresponding segmentation regions, which is more fine-grained than current attention mechanisms. Moreover, in order to process features of different classes independently, we propose a dual-attention module which is generic and can be applied to other frameworks. Experimental results showed that our model could recognize the overlapped objects and understand the scene better. Our approach achieved competitive performance against state-of-the-art methods. We made our code available.

Download Full-text

EXAM: A Framework of Learning Extreme and Moderate Embeddings for Person Re-ID

Journal of Imaging ◽

10.3390/jimaging7010006 ◽

2021 ◽

Vol 7 (1) ◽

pp. 6

Author(s):

Guanqiu Qi ◽

Gang Hu ◽

Xiaofei Wang ◽

Neal Mazur ◽

Zhiqin Zhu ◽

...

Keyword(s):

State Of The Art ◽

Feature Learning ◽

Feature Representation ◽

Cross Entropy ◽

Loss Functions ◽

Entropy Loss ◽

Max Pooling ◽

Discriminative Feature ◽

Bounding Boxes ◽

New Framework

Person re-identification (Re-ID) is challenging due to host of factors: the variety of human positions, difficulties in aligning bounding boxes, and complex backgrounds, among other factors. This paper proposes a new framework called EXAM (EXtreme And Moderate feature embeddings) for Re-ID tasks. This is done using discriminative feature learning, requiring attention-based guidance during training. Here “Extreme” refers to salient human features and “Moderate” refers to common human features. In this framework, these types of embeddings are calculated by global max-pooling and average-pooling operations respectively; and then, jointly supervised by multiple triplet and cross-entropy loss functions. The processes of deducing attention from learned embeddings and discriminative feature learning are incorporated, and benefit from each other in this end-to-end framework. From the comparative experiments and ablation studies, it is shown that the proposed EXAM is effective, and its learned feature representation reaches state-of-the-art performance.

Download Full-text

On Parameter Adaptation in Softmax-Based Cross-Entropy Loss for Improved Convergence Speed and Accuracy in DNN-Based Speaker Recognition

10.21437/interspeech.2020-2264 ◽

2020 ◽

Author(s):

Magdalena Rybicka ◽

Konrad Kowalczyk

Keyword(s):

Speaker Recognition ◽

Convergence Speed ◽

Cross Entropy ◽

Parameter Adaptation ◽

Entropy Loss ◽

Speed And Accuracy

Download Full-text

Multi-hop assortativities for network classification

Journal of Complex Networks ◽

10.1093/comnet/cny034 ◽

2018 ◽

Vol 7 (4) ◽

pp. 603-622 ◽

Cited By ~ 1

Author(s):

Leonardo Gutiérrez-Gómez ◽

Jean-Charles Delvenne

Keyword(s):

Machine Learning ◽

Scientific Collaboration ◽

State Of The Art ◽

Medical Engineering ◽

Research Field ◽

Classification Task ◽

Collaboration Network ◽

Structural Patterns ◽

Art Methods

Abstract Several social, medical, engineering and biological challenges rely on discovering the functionality of networks from their structure and node metadata, when it is available. For example, in chemoinformatics one might want to detect whether a molecule is toxic based on structure and atomic types, or discover the research field of a scientific collaboration network. Existing techniques rely on counting or measuring structural patterns that are known to show large variations from network to network, such as the number of triangles, or the assortativity of node metadata. We introduce the concept of multi-hop assortativity, that captures the similarity of the nodes situated at the extremities of a randomly selected path of a given length. We show that multi-hop assortativity unifies various existing concepts and offers a versatile family of ‘fingerprints’ to characterize networks. These fingerprints allow in turn to recover the functionalities of a network, with the help of the machine learning toolbox. Our method is evaluated empirically on established social and chemoinformatic network benchmarks. Results reveal that our assortativity based features are competitive providing highly accurate results often outperforming state of the art methods for the network classification task.

Download Full-text

Automatic Detection of Discrimination Actions from Social Images

Electronics ◽

10.3390/electronics10030325 ◽

2021 ◽

Vol 10 (3) ◽

pp. 325

Author(s):

Zhihao Wu ◽

Baopeng Zhang ◽

Tianchen Zhou ◽

Yan Li ◽

Jianping Fan

Keyword(s):

Action Recognition ◽

State Of The Art ◽

Automatic Detection ◽

Experimental Results ◽

Practical Approach ◽

Detection And Identification ◽

Art Methods ◽

Image Set ◽

Social Images ◽

Relationship Identification

In this paper, we developed a practical approach for automatic detection of discrimination actions from social images. Firstly, an image set is established, in which various discrimination actions and relations are manually labeled. To the best of our knowledge, this is the first work to create a dataset for discrimination action recognition and relationship identification. Secondly, a practical approach is developed to achieve automatic detection and identification of discrimination actions and relationships from social images. Thirdly, the task of relationship identification is seamlessly integrated with the task of discrimination action recognition into one single network called the Co-operative Visual Translation Embedding++ network (CVTransE++). We also compared our proposed method with numerous state-of-the-art methods, and our experimental results demonstrated that our proposed methods can significantly outperform state-of-the-art approaches.

Download Full-text

A Deep Learning Approach to Predict Autism Spectrum Disorder Using Multisite Resting-State fMRI

Applied Sciences ◽

10.3390/app11083636 ◽

2021 ◽

Vol 11 (8) ◽

pp. 3636

Author(s):

Faria Zarin Subah ◽

Kaushik Deb ◽

Pranab Kumar Dhar ◽

Takeshi Koshiba

Keyword(s):

Autism Spectrum Disorder ◽

Resting State ◽

State Of The Art ◽

Resting State Fmri ◽

Autism Spectrum ◽

Spectrum Disorder ◽

Bootstrap Analysis ◽

Proposed Model ◽

Art Methods ◽

The Mean

Autism spectrum disorder (ASD) is a complex and degenerative neuro-developmental disorder. Most of the existing methods utilize functional magnetic resonance imaging (fMRI) to detect ASD with a very limited dataset which provides high accuracy but results in poor generalization. To overcome this limitation and to enhance the performance of the automated autism diagnosis model, in this paper, we propose an ASD detection model using functional connectivity features of resting-state fMRI data. Our proposed model utilizes two commonly used brain atlases, Craddock 200 (CC200) and Automated Anatomical Labelling (AAL), and two rarely used atlases Bootstrap Analysis of Stable Clusters (BASC) and Power. A deep neural network (DNN) classifier is used to perform the classification task. Simulation results indicate that the proposed model outperforms state-of-the-art methods in terms of accuracy. The mean accuracy of the proposed model was 88%, whereas the mean accuracy of the state-of-the-art methods ranged from 67% to 85%. The sensitivity, F1-score, and area under receiver operating characteristic curve (AUC) score of the proposed model were 90%, 87%, and 96%, respectively. Comparative analysis on various scoring strategies show the superiority of BASC atlas over other aforementioned atlases in classifying ASD and control.

Download Full-text