scholarly journals Show, Recall, and Tell: Image Captioning with Recall Mechanism

2020 ◽  
Vol 34 (07) ◽  
pp. 12176-12183
Author(s):  
Li Wang ◽  
Zechen Bai ◽  
Yonghua Zhang ◽  
Hongtao Lu

Generating natural and accurate descriptions in image captioning has always been a challenge. In this paper, we propose a novel recall mechanism to imitate the way human conduct captioning. There are three parts in our recall mechanism : recall unit, semantic guide (SG) and recalled-word slot (RWS). Recall unit is a text-retrieval module designed to retrieve recalled words for images. SG and RWS are designed for the best use of recalled words. SG branch can generate a recalled context, which can guide the process of generating caption. RWS branch is responsible for copying recalled words to the caption. Inspired by pointing mechanism in text summarization, we adopt a soft switch to balance the generated-word probabilities between SG and RWS. In the CIDEr optimization step, we also introduce an individual recalled-word reward (WR) to boost training. Our proposed methods (SG+RWS+WR) achieve BLEU-4 / CIDEr / SPICE scores of 36.6 / 116.9 / 21.3 with cross-entropy loss and 38.7 / 129.1 / 22.4 with CIDEr optimization on MSCOCO Karpathy test split, which surpass the results of other state-of-the-art methods.

Author(s):  
Siying Wu ◽  
Zheng-Jun Zha ◽  
Zilei Wang ◽  
Houqiang Li ◽  
Feng Wu

Image paragraph generation aims to describe an image with a paragraph in natural language. Compared to image captioning with a single sentence, paragraph generation provides more expressive and fine-grained description for storytelling. Existing approaches mainly optimize paragraph generator towards minimizing word-wise cross entropy loss, which neglects linguistic hierarchy of paragraph and results in ``sparse" supervision for generator learning. In this paper, we propose a novel Densely Supervised Hierarchical Policy-Value (DHPV) network for effective paragraph generation. We design new hierarchical supervisions consisting of hierarchical rewards and values at both sentence and word levels. The joint exploration of hierarchical rewards and values provides dense supervision cues for learning effective paragraph generator. We propose a new hierarchical policy-value architecture which exploits compositionality at token-to-token and sentence-to-sentence levels simultaneously and can preserve the semantic and syntactic constituent integrity. Extensive experiments on the Stanford image-paragraph benchmark have demonstrated the effectiveness of the proposed DHPV approach with performance improvements over multiple state-of-the-art methods.


Author(s):  
Anca Daniela Ionita

This chapter presents the fundamental ideas related to migrating legacy applications to service-oriented systems, and provides an overview of the available approaches that are presented in this book. The goal is to provide a “big picture” while also analyzing each chapter and indicating the way it covers several essential concerns, such as state-of-the-art, methods, standards, tools, business perspective, practical experiments, strategies, and roadmaps.


2020 ◽  
Vol 10 (1) ◽  
pp. 391
Author(s):  
Wenjie Cai ◽  
Zheng Xiong ◽  
Xianfang Sun ◽  
Paul L. Rosin ◽  
Longcun Jin ◽  
...  

Image captioning is the task of generating textual descriptions of images. In order to obtain a better image representation, attention mechanisms have been widely adopted in image captioning. However, in existing models with detection-based attention, the rectangular attention regions are not fine-grained, as they contain irrelevant regions (e.g., background or overlapped regions) around the object, making the model generate inaccurate captions. To address this issue, we propose panoptic segmentation-based attention that performs attention at a mask-level (i.e., the shape of the main part of an instance). Our approach extracts feature vectors from the corresponding segmentation regions, which is more fine-grained than current attention mechanisms. Moreover, in order to process features of different classes independently, we propose a dual-attention module which is generic and can be applied to other frameworks. Experimental results showed that our model could recognize the overlapped objects and understand the scene better. Our approach achieved competitive performance against state-of-the-art methods. We made our code available.


2021 ◽  
Vol 7 (1) ◽  
pp. 6
Author(s):  
Guanqiu Qi ◽  
Gang Hu ◽  
Xiaofei Wang ◽  
Neal Mazur ◽  
Zhiqin Zhu ◽  
...  

Person re-identification (Re-ID) is challenging due to host of factors: the variety of human positions, difficulties in aligning bounding boxes, and complex backgrounds, among other factors. This paper proposes a new framework called EXAM (EXtreme And Moderate feature embeddings) for Re-ID tasks. This is done using discriminative feature learning, requiring attention-based guidance during training. Here “Extreme” refers to salient human features and “Moderate” refers to common human features. In this framework, these types of embeddings are calculated by global max-pooling and average-pooling operations respectively; and then, jointly supervised by multiple triplet and cross-entropy loss functions. The processes of deducing attention from learned embeddings and discriminative feature learning are incorporated, and benefit from each other in this end-to-end framework. From the comparative experiments and ablation studies, it is shown that the proposed EXAM is effective, and its learned feature representation reaches state-of-the-art performance.


2018 ◽  
Vol 7 (4) ◽  
pp. 603-622 ◽  
Author(s):  
Leonardo Gutiérrez-Gómez ◽  
Jean-Charles Delvenne

Abstract Several social, medical, engineering and biological challenges rely on discovering the functionality of networks from their structure and node metadata, when it is available. For example, in chemoinformatics one might want to detect whether a molecule is toxic based on structure and atomic types, or discover the research field of a scientific collaboration network. Existing techniques rely on counting or measuring structural patterns that are known to show large variations from network to network, such as the number of triangles, or the assortativity of node metadata. We introduce the concept of multi-hop assortativity, that captures the similarity of the nodes situated at the extremities of a randomly selected path of a given length. We show that multi-hop assortativity unifies various existing concepts and offers a versatile family of ‘fingerprints’ to characterize networks. These fingerprints allow in turn to recover the functionalities of a network, with the help of the machine learning toolbox. Our method is evaluated empirically on established social and chemoinformatic network benchmarks. Results reveal that our assortativity based features are competitive providing highly accurate results often outperforming state of the art methods for the network classification task.


Electronics ◽  
2021 ◽  
Vol 10 (3) ◽  
pp. 325
Author(s):  
Zhihao Wu ◽  
Baopeng Zhang ◽  
Tianchen Zhou ◽  
Yan Li ◽  
Jianping Fan

In this paper, we developed a practical approach for automatic detection of discrimination actions from social images. Firstly, an image set is established, in which various discrimination actions and relations are manually labeled. To the best of our knowledge, this is the first work to create a dataset for discrimination action recognition and relationship identification. Secondly, a practical approach is developed to achieve automatic detection and identification of discrimination actions and relationships from social images. Thirdly, the task of relationship identification is seamlessly integrated with the task of discrimination action recognition into one single network called the Co-operative Visual Translation Embedding++ network (CVTransE++). We also compared our proposed method with numerous state-of-the-art methods, and our experimental results demonstrated that our proposed methods can significantly outperform state-of-the-art approaches.


2021 ◽  
Vol 11 (8) ◽  
pp. 3636
Author(s):  
Faria Zarin Subah ◽  
Kaushik Deb ◽  
Pranab Kumar Dhar ◽  
Takeshi Koshiba

Autism spectrum disorder (ASD) is a complex and degenerative neuro-developmental disorder. Most of the existing methods utilize functional magnetic resonance imaging (fMRI) to detect ASD with a very limited dataset which provides high accuracy but results in poor generalization. To overcome this limitation and to enhance the performance of the automated autism diagnosis model, in this paper, we propose an ASD detection model using functional connectivity features of resting-state fMRI data. Our proposed model utilizes two commonly used brain atlases, Craddock 200 (CC200) and Automated Anatomical Labelling (AAL), and two rarely used atlases Bootstrap Analysis of Stable Clusters (BASC) and Power. A deep neural network (DNN) classifier is used to perform the classification task. Simulation results indicate that the proposed model outperforms state-of-the-art methods in terms of accuracy. The mean accuracy of the proposed model was 88%, whereas the mean accuracy of the state-of-the-art methods ranged from 67% to 85%. The sensitivity, F1-score, and area under receiver operating characteristic curve (AUC) score of the proposed model were 90%, 87%, and 96%, respectively. Comparative analysis on various scoring strategies show the superiority of BASC atlas over other aforementioned atlases in classifying ASD and control.


Sign in / Sign up

Export Citation Format

Share Document