scholarly journals A Multi-task Learning Approach for Image Captioning

Author(s):  
Wei Zhao ◽  
Benyou Wang ◽  
Jianbo Ye ◽  
Min Yang ◽  
Zhou Zhao ◽  
...  

In this paper, we propose a Multi-task Learning Approach for Image Captioning (MLAIC ), motivated by the fact that humans have no difficulty performing such task because they possess capabilities of multiple domains. Specifically, MLAIC consists of three key components: (i) A multi-object classification model that learns rich category-aware image representations using a CNN image encoder; (ii) A syntax generation model that learns better syntax-aware LSTM based decoder; (iii) An image captioning model that generates image descriptions in text, sharing its CNN encoder and LSTM decoder with the object classification task and the syntax generation task, respectively. In particular, the image captioning model can benefit from the additional object categorization and syntax knowledge. To verify the effectiveness of our approach, we conduct extensive experiments on MS-COCO dataset. The experimental results demonstrate that our model achieves impressive results compared to other strong competitors.

Author(s):  
Wenyu Du ◽  
Baocheng Li ◽  
Min Yang ◽  
Qiang Qu ◽  
Ying Shen

In this paper, we propose a Multi-Task learning approach for Answer Selection (MTAS), motivated by the fact that humans have no difficulty performing such task because they possess capabilities of multiple domains (tasks). Specifically, MTAS consists of two key components: (i) A category classification model that learns rich category-aware document representation; (ii) An answer selection model that provides the matching scores of question-answer pairs. These two tasks work on a shared document encoding layer, and they cooperate to learn a high-quality answer selection system. In addition, a multi-head attention mechanism is proposed to learn important information from different representation subspaces at different positions. We manually annotate the first Chinese question answering dataset in law domain (denoted as LawQA) to evaluate the effectiveness of our model. The experimental results show that our model MTAS consistently outperforms the compared methods.1


Author(s):  
Yao Lu ◽  
Linqing Liu ◽  
Zhile Jiang ◽  
Min Yang ◽  
Randy Goebel

We propose a Multi-task learning approach for Abstractive Text Summarization (MATS), motivated by the fact that humans have no difficulty performing such task because they have the capabilities of multiple domains. Specifically, MATS consists of three components: (i) a text categorization model that learns rich category-specific text representations using a bi-LSTM encoder; (ii) a syntax labeling model that learns to improve the syntax-aware LSTM decoder; and (iii) an abstractive text summarization model that shares its encoder and decoder with the text categorization and the syntax labeling tasks, respectively. In particular, the abstractive text summarization model enjoys significant benefit from the additional text categorization and syntax knowledge. Our experimental results show that MATS outperforms the competitors.1


2021 ◽  
Vol 11 (9) ◽  
pp. 3952
Author(s):  
Shimin Tang ◽  
Zhiqiang Chen

With the ubiquitous use of mobile imaging devices, the collection of perishable disaster-scene data has become unprecedentedly easy. However, computing methods are unable to understand these images with significant complexity and uncertainties. In this paper, the authors investigate the problem of disaster-scene understanding through a deep-learning approach. Two attributes of images are concerned, including hazard types and damage levels. Three deep-learning models are trained, and their performance is assessed. Specifically, the best model for hazard-type prediction has an overall accuracy (OA) of 90.1%, and the best damage-level classification model has an explainable OA of 62.6%, upon which both models adopt the Faster R-CNN architecture with a ResNet50 network as a feature extractor. It is concluded that hazard types are more identifiable than damage levels in disaster-scene images. Insights are revealed, including that damage-level recognition suffers more from inter- and intra-class variations, and the treatment of hazard-agnostic damage leveling further contributes to the underlying uncertainties.


2021 ◽  
Author(s):  
Yang Zhang ◽  
Ruohan Zong ◽  
Lanyu Shang ◽  
Md Tahmid Rashid ◽  
Dong Wang

Author(s):  
Toshiba Kamruzzaman ◽  
Soomanib Kamruzzaman ◽  
Abir Zaman

2020 ◽  
Vol 34 (05) ◽  
pp. 9571-9578 ◽  
Author(s):  
Wei Zhang ◽  
Yue Ying ◽  
Pan Lu ◽  
Hongyuan Zha

Personalized image caption, a natural extension of the standard image caption task, requires to generate brief image descriptions tailored for users' writing style and traits, and is more practical to meet users' real demands. Only a few recent studies shed light on this crucial task and learn static user representations to capture their long-term literal-preference. However, it is insufficient to achieve satisfactory performance due to the intrinsic existence of not only long-term user literal-preference, but also short-term literal-preference which is associated with users' recent states. To bridge this gap, we develop a novel multimodal hierarchical transformer network (MHTN) for personalized image caption in this paper. It learns short-term user literal-preference based on users' recent captions through a short-term user encoder at the low level. And at the high level, the multimodal encoder integrates target image representations with short-term literal-preference, as well as long-term literal-preference learned from user IDs. These two encoders enjoy the advantages of the powerful transformer networks. Extensive experiments on two real datasets show the effectiveness of considering two types of user literal-preference simultaneously and better performance over the state-of-the-art models.


2022 ◽  
Author(s):  
Maede Maftouni ◽  
Bo Shen ◽  
Andrew Chung Chee Law ◽  
Niloofar Ayoobi Yazdi ◽  
Zhenyu Kong

<p>The global extent of COVID-19 mutations and the consequent depletion of hospital resources highlighted the necessity of effective computer-assisted medical diagnosis. COVID-19 detection mediated by deep learning models can help diagnose this highly contagious disease and lower infectivity and mortality rates. Computed tomography (CT) is the preferred imaging modality for building automatic COVID-19 screening and diagnosis models. It is well-known that the training set size significantly impacts the performance and generalization of deep learning models. However, accessing a large dataset of CT scan images from an emerging disease like COVID-19 is challenging. Therefore, data efficiency becomes a significant factor in choosing a learning model. To this end, we present a multi-task learning approach, namely, a mask-guided attention (MGA) classifier, to improve the generalization and data efficiency of COVID-19 classification on lung CT scan images.</p><p>The novelty of this method is compensating for the scarcity of data by employing more supervision with lesion masks, increasing the sensitivity of the model to COVID-19 manifestations, and helping both generalization and classification performance. Our proposed model achieves better overall performance than the single-task baseline and state-of-the-art models, as measured by various popular metrics. In our experiment with different percentages of data from our curated dataset, the classification performance gain from this multi-task learning approach is more significant for the smaller training sizes. Furthermore, experimental results demonstrate that our method enhances the focus on the lesions, as witnessed by both</p><p>attention and attribution maps, resulting in a more interpretable model.</p>


Sign in / Sign up

Export Citation Format

Share Document