Hierarchical Attention Network for Image Captioning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018957 ◽

2019 ◽

Vol 33 ◽

pp. 8957-8964 ◽

Cited By ~ 7

Author(s):

Weixuan Wang ◽

Zhihong Chen ◽

Haifeng Hu

Keyword(s):

State Of The Art ◽

The Other ◽

Multimodal Integration ◽

Image Captioning ◽

Attention Network ◽

Spatial Features ◽

Text Features ◽

Integration Strategies ◽

High Level ◽

Hierarchical Features

Recently, attention mechanism has been successfully applied in image captioning, but the existing attention methods are only established on low-level spatial features or high-level text features, which limits richness of captions. In this paper, we propose a Hierarchical Attention Network (HAN) that enables attention to be calculated on pyramidal hierarchy of features synchronously. The pyramidal hierarchy consists of features on diverse semantic levels, which allows predicting different words according to different features. On the other hand, due to the different modalities of features, a Multivariate Residual Module (MRM) is proposed to learn the joint representations from features. The MRM is able to model projections and extract relevant relations among different features. Furthermore, we introduce a context gate to balance the contribution of different features. Compared with the existing methods, our approach applies hierarchical features and exploits several multimodal integration strategies, which can significantly improve the performance. The HAN is verified on benchmark MSCOCO dataset, and the experimental results indicate that our model outperforms the state-of-the-art methods, achieving a BLEU1 score of 80.9 and a CIDEr score of 121.7 in the Karpathy’s test split.

Download Full-text

Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking

KI - Künstliche Intelligenz ◽

10.1007/s13218-020-00679-2 ◽

2020 ◽

Vol 34 (4) ◽

pp. 571-584

Author(s):

Rajarshi Biswas ◽

Michael Barz ◽

Daniel Sonntag

Keyword(s):

State Of The Art ◽

Input Image ◽

The State ◽

Beam Search ◽

Image Captioning ◽

Bottom Up ◽

Interactive Machine Learning ◽

Joint Embedding ◽

Bounding Boxes ◽

High Level

AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.

Download Full-text

3M: Multi-style image caption generation using Multi-modality features under Multi-UPDOWN model

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i1.128380 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

Chengxi Li ◽

Brent Harrison

Keyword(s):

Qualitative Study ◽

State Of The Art ◽

Image Features ◽

Generative Model ◽

Image Captioning ◽

Text Features ◽

Image Caption Generation ◽

Image Caption

In this paper, we build a multi-style generative model for stylish image captioning which uses multi-modality image features, ResNeXt features, and text features generated by DenseCap. We propose the 3M model, a Multi-UPDOWN caption model that encodes multi-modality features and decodes them into captions. We demonstrate the effectiveness of our model on generating human-like captions by examining its performance on two datasets, the PERSONALITY-CAPTIONS dataset, and the FlickrStyle10K dataset. We compare against a variety of state-of-the-art baselines on various automatic NLP metrics such as BLEU, ROUGE-L, CIDEr, SPICE, etc \footnote{code will be available at https://github.com/cici-ai-club/3M}. A qualitative study has also been done to verify our 3M model can be used for generating different stylized captions.

Download Full-text

Natural Language Processing in Serious Games: A state of the art.

International Journal of Serious Games ◽

10.17083/ijsg.v2i3.87 ◽

2015 ◽

Vol 2 (3) ◽

Cited By ~ 5

Author(s):

Davide Picca ◽

Dominique Jaccard ◽

Gérald Eberlé

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Serious Games ◽

State Of The Art ◽

Serious Game ◽

The Other ◽

Other Hand ◽

The One ◽

High Level

In the last decades, Natural Language Processing (NLP) has obtained a high level of success. Interactions between NLP and Serious Games have started and some of them already include NLP techniques. The objectives of this paper are twofold: on the one hand, providing a simple framework to enable analysis of potential uses of NLP in Serious Games and, on the other hand, applying the NLP framework to existing Serious Games and giving an overview of the use of NLP in pedagogical Serious Games. In this paper we present 11 serious games exploiting NLP techniques. We present them systematically, according to the following structure: first, we highlight possible uses of NLP techniques in Serious Games, second, we describe the type of NLP implemented in the each specific Serious Game and, third, we provide a link to possible purposes of use for the different actors interacting in the Serious Game.

Download Full-text

TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/91 ◽

2021 ◽

Author(s):

Zhihao Fan ◽

Zhongyu Wei ◽

Siyuan Wang ◽

Ruize Wang ◽

Zejun Li ◽

...

Keyword(s):

State Of The Art ◽

Representation Learning ◽

Experimental Results ◽

Text Representation ◽

Image Captioning ◽

Scene Graph ◽

Low Level ◽

Language And Vision ◽

High Level ◽

Cross Language

Existing research for image captioning usually represents an image using a scene graph with low-level facts (objects and relations) and fails to capture the high-level semantics. In this paper, we propose a Theme Concepts extended Image Captioning (TCIC) framework that incorporates theme concepts to represent high-level cross-modality semantics. In practice, we model theme concepts as memory vectors and propose Transformer with Theme Nodes (TTN) to incorporate those vectors for image captioning. Considering that theme concepts can be learned from both images and captions, we propose two settings for their representations learning based on TTN. On the vision side, TTN is configured to take both scene graph based features and theme concepts as input for visual representation learning. On the language side, TTN is configured to take both captions and theme concepts as input for text representation re-construction. Both settings aim to generate target captions with the same transformer-based decoder. During the training, we further align representations of theme concepts learned from images and corresponding captions to enforce the cross-modality learning. Experimental results on MS COCO show the effectiveness of our approach compared to some state-of-the-art models.

Download Full-text

Color-Sensitive Person Re-Identification

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/131 ◽

2019 ◽

Cited By ~ 2

Author(s):

Guan'an Wang ◽

Yang Yang ◽

Jian Cheng ◽

Jinqiao Wang ◽

Zengguang Hou

Keyword(s):

Full Advantage ◽

State Of The Art ◽

Color Difference ◽

The Other ◽

Semantic Features ◽

Color Information ◽

Benchmark Datasets ◽

Identity Consistency ◽

High Level ◽

Sensitive Person

Recent deep Re-ID models mainly focus on learning high-level semantic features, while failing to explicitly explore color information which is one of the most important cues for person Re-ID. In this paper, we propose a novel Color-Sensitive Re-ID to take full advantage of color information. On one hand, we train our model with real and fake images. By using the extra fake images, more color information can be exploited and it can avoid overfitting during training. On the other hand, we also train our model with images of the same person with different colors. By doing so, features can be forced to focus on the color difference in regions. To generate fake images with specified colors, we propose a novel Color Translation GAN (CTGAN) to learn mappings between different clothing colors and preserve identity consistency among the same clothing color. Extensive evaluations on two benchmark datasets show that our approach significantly outperforms state-of-the-art Re-ID models.

Download Full-text

Image Cationing with Visual-Semantic LSTM

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/110 ◽

2018 ◽

Cited By ~ 6

Author(s):

Nannan Li ◽

Zhenzhong Chen

Keyword(s):

Visual Processing ◽

State Of The Art ◽

Sampling Strategy ◽

Experimental Results ◽

Visual Cell ◽

Semantic Features ◽

Training Process ◽

Image Captioning ◽

Low Level ◽

High Level

In this paper, a novel image captioning approach is proposed to describe the content of images. Inspired by the visual processing of our cognitive system, we propose a visual-semantic LSTM model to locate the attention objects with their low-level features in the visual cell, and then successively extract high-level semantic features in the semantic cell. In addition, a state perturbation term is introduced to the word sampling strategy in the REINFORCE based method to explore proper vocabularies in the training process. Experimental results on MS COCO and Flickr30K validate the effectiveness of our approach when compared to the state-of-the-art methods.

Download Full-text

Depression Detection on Reddit With an Emotion-Based Attention Network: Algorithm Development and Validation (Preprint)

10.2196/preprints.28754 ◽

2021 ◽

Author(s):

Lu Ren ◽

Hongfei Lin ◽

Bo Xu ◽

Shaowu Zhang ◽

Liang Yang ◽

...

Keyword(s):

Semantic Information ◽

State Of The Art ◽

Emotion Understanding ◽

Experimental Results ◽

Network Module ◽

Emotional Information ◽

Attention Network ◽

Depression Detection ◽

Self Harm ◽

High Level

BACKGROUND As a common mental disease, depression seriously affects people’s physical and mental health. According to the statistics of the World Health Organization, depression is one of the main reasons for suicide and self-harm events in the world. Therefore, strengthening depression detection can effectively reduce the occurrence of suicide or self-harm events so as to save more people and families. With the development of computer technology, some researchers are trying to apply natural language processing techniques to detect people who are depressed automatically. Many existing feature engineering methods for depression detection are based on emotional characteristics, but these methods do not consider high-level emotional semantic information. The current deep learning methods for depression detection cannot accurately extract effective emotional semantic information. OBJECTIVE In this paper, we propose an emotion-based attention network, including a semantic understanding network and an emotion understanding network, which can capture the high-level emotional semantic information effectively to improve the depression detection task. METHODS The semantic understanding network module is used to capture the contextual semantic information. The emotion understanding network module is used to capture the emotional semantic information. There are two units in the emotion understanding network module, including a positive emotion understanding unit and a negative emotion understanding unit, which are used to capture the positive emotional information and the negative emotional information, respectively. We further proposed a dynamic fusion strategy in the emotion understanding network module to fuse the positive emotional information and the negative emotional information. RESULTS We evaluated our method on the Reddit data set. The experimental results showed that the proposed emotion-based attention network model achieved an accuracy, precision, recall, and F-measure of 91.30%, 91.91%, 96.15%, and 93.98%, respectively, which are comparable results compared with state-of-the-art methods. CONCLUSIONS The experimental results showed that our model is competitive with the state-of-the-art models. The semantic understanding network module, the emotion understanding network module, and the dynamic fusion strategy are effective modules for depression detection. In addition, the experimental results verified that the emotional semantic information was effective in depression detection.

Download Full-text

Multi-View Attention Network for Visual Dialog

Applied Sciences ◽

10.3390/app11073009 ◽

2021 ◽

Vol 11 (7) ◽

pp. 3009

Author(s):

Sungjin Park ◽

Taesun Whang ◽

Yeochan Yoon ◽

Heuiseok Lim

Keyword(s):

State Of The Art ◽

Relevant Information ◽

Experimental Results ◽

Multiple Views ◽

Single Model ◽

Attention Network ◽

Proposed Model ◽

Previous State ◽

Multimodal Representations ◽

High Level

Visual dialog is a challenging vision-language task in which a series of questions visually grounded by a given image are answered. To resolve the visual dialog task, a high-level understanding of various multimodal inputs (e.g., question, dialog history, and image) is required. Specifically, it is necessary for an agent to (1) determine the semantic intent of question and (2) align question-relevant textual and visual contents among heterogeneous modality inputs. In this paper, we propose Multi-View Attention Network (MVAN), which leverages multiple views about heterogeneous inputs based on attention mechanisms. MVAN effectively captures the question-relevant information from the dialog history with two complementary modules (i.e., Topic Aggregation and Context Matching), and builds multimodal representations through sequential alignment processes (i.e., Modality Alignment). Experimental results on VisDial v1.0 dataset show the effectiveness of our proposed model, which outperforms previous state-of-the-art methods under both single model and ensemble settings.

Download Full-text

Depression Detection on Reddit With an Emotion-Based Attention Network: Algorithm Development and Validation

JMIR Medical Informatics ◽

10.2196/28754 ◽

2021 ◽

Vol 9 (7) ◽

pp. e28754

Author(s):

Lu Ren ◽

Hongfei Lin ◽

Bo Xu ◽

Shaowu Zhang ◽

Liang Yang ◽

...

Keyword(s):

Semantic Information ◽

State Of The Art ◽

Emotion Understanding ◽

Experimental Results ◽

Network Module ◽

Emotional Information ◽

Attention Network ◽

Depression Detection ◽

Self Harm ◽

High Level

Background As a common mental disease, depression seriously affects people’s physical and mental health. According to the statistics of the World Health Organization, depression is one of the main reasons for suicide and self-harm events in the world. Therefore, strengthening depression detection can effectively reduce the occurrence of suicide or self-harm events so as to save more people and families. With the development of computer technology, some researchers are trying to apply natural language processing techniques to detect people who are depressed automatically. Many existing feature engineering methods for depression detection are based on emotional characteristics, but these methods do not consider high-level emotional semantic information. The current deep learning methods for depression detection cannot accurately extract effective emotional semantic information. Objective In this paper, we propose an emotion-based attention network, including a semantic understanding network and an emotion understanding network, which can capture the high-level emotional semantic information effectively to improve the depression detection task. Methods The semantic understanding network module is used to capture the contextual semantic information. The emotion understanding network module is used to capture the emotional semantic information. There are two units in the emotion understanding network module, including a positive emotion understanding unit and a negative emotion understanding unit, which are used to capture the positive emotional information and the negative emotional information, respectively. We further proposed a dynamic fusion strategy in the emotion understanding network module to fuse the positive emotional information and the negative emotional information. Results We evaluated our method on the Reddit data set. The experimental results showed that the proposed emotion-based attention network model achieved an accuracy, precision, recall, and F-measure of 91.30%, 91.91%, 96.15%, and 93.98%, respectively, which are comparable results compared with state-of-the-art methods. Conclusions The experimental results showed that our model is competitive with the state-of-the-art models. The semantic understanding network module, the emotion understanding network module, and the dynamic fusion strategy are effective modules for depression detection. In addition, the experimental results verified that the emotional semantic information was effective in depression detection.

Download Full-text

Ethics in the Judiciary System of Bangladesh

Bangladesh Journal of Bioethics ◽

10.3329/bioethics.v1i2.9628 ◽

1970 ◽

Vol 1 (2) ◽

pp. 34-36

Author(s):

Mehedi Imam

Keyword(s):

Rule Of Law ◽

Judicial Independence ◽

Moral Values ◽

The Other ◽

Considerable Time ◽

Ethical Knowledge ◽

Judicial Practice ◽

Independent Judiciary ◽

Role Players ◽

High Level

In Bangladesh, demand for judicial independence in practice has been a much debated issue and the demand is fulfilled but expectation of people is not only limited to have an independent judiciary but to have an impartial system and cadre of people, which will administer justice rationally being free from fear or force. The independence of judiciary and the impartial judicial practice are related concepts, one cannot sustain without the other and here existence as well as the need of practicing impartiality is well recognized. But the art of practicing impartiality does not develop overnight as it’s related to development of one’s attitude. It takes a considerable time resulting from understanding, appreciating and acknowledging the moral values, ethics and professional responsibility. The judiciary includes Judges, Advocates mostly who are expected to demonstrate a high level of moral values and impartiality towards people seeking justice and ‘rule of law’. This is true that bench officers and clerks are also part of the process to ensure rule of law with same level of participation by the law enforcing agencies such as police. However the paper includes only those who either join judiciary as Judge/Magistrate or Advocate to explore level and extent of ethical knowledge they receive being key role players of the system. DOI: http://dx.doi.org/10.3329/bioethics.v1i2.9628 Bangladesh Journal of Bioethics 2010; 1(2): 34-36

Download Full-text