Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking

AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.

Download Full-text

Fast and Simple Mixture of Softmaxes with BPE and Hybrid-LightRNN for Language Generation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016626 ◽

2019 ◽

Vol 33 ◽

pp. 6626-6633

Author(s):

Xiang Kong ◽

Qizhe Xie ◽

Zihang Dai ◽

Eduard Hovy

Keyword(s):

Machine Translation ◽

State Of The Art ◽

The State ◽

Computational Time ◽

Memory Consumption ◽

Image Captioning ◽

Vocabulary Size ◽

Language Generation ◽

Practical Applications ◽

Coding Schemes

Mixture of Softmaxes (MoS) has been shown to be effective at addressing the expressiveness limitation of Softmax-based models. Despite the known advantage, MoS is practically sealed by its large consumption of memory and computational time due to the need of computing multiple Softmaxes. In this work, we set out to unleash the power of MoS in practical applications by investigating improved word coding schemes, which could effectively reduce the vocabulary size and hence relieve the memory and computation burden. We show both BPE and our proposed Hybrid-LightRNN lead to improved encoding mechanisms that can halve the time and memory consumption of MoS without performance losses. With MoS, we achieve an improvement of 1.5 BLEU scores on IWSLT 2014 German-to-English corpus and an improvement of 0.76 CIDEr score on image captioning. Moreover, on the larger WMT 2014 machine translation dataset, our MoSboosted Transformer yields 29.6 BLEU score for English-toGerman and 42.1 BLEU score for English-to-French, outperforming the single-Softmax Transformer by 0.9 and 0.4 BLEU scores respectively and achieving the state-of-the-art result on WMT 2014 English-to-German task.

Download Full-text

Masking Models and Watermarking

Intellectual Property Protection for Multimedia Information Technology ◽

10.4018/978-1-59904-762-1.ch004 ◽

2011 ◽

pp. 93-116

Author(s):

Mirko Luca Lobina ◽

Luigi Atzori ◽

Davide Mula

Keyword(s):

Signal Processing ◽

Intellectual Property ◽

State Of The Art ◽

The State ◽

Audio Watermarking ◽

Digital Right Management ◽

The Future ◽

Psychological Models ◽

High Level ◽

The Relationship

Many audio watermarking techniques presented in the last years make use of masking and psychological models derived from signal processing. Such a basic idea is winning because it guarantees a high level of robustness and bandwidth of the watermark as well as fidelity of the watermarked signal. This chapter first describes the relationship between digital right management, intellectual property, and use of watermarking techniques. Then, the crossing use of watermarking and masking models is detailed, providing schemes, examples, and references. Finally, the authors present two strategies that make use of a masking model, applied to a classic watermarking technique. The joint use of classic frameworks and masking models seems to be one of the trends for the future of research in watermarking. Several tests on the proposed strategies with the state of the art are also offered to give an idea of how to assess the effectiveness of a watermarking technique.

Download Full-text

The State-of-the Art of the Borehole Disposal Concept for High Level Radioactive Waste

Journal of the Nuclear Fuel Cycle and Waste Technology(JNFCWT) ◽

10.7733/jkrws.2012.10.1.055 ◽

2012 ◽

Vol 10 (1) ◽

pp. 55-62 ◽

Cited By ~ 3

Author(s):

Sung-Hoon Ji ◽

Yong-Kwon Koh ◽

Jong-Won Choi

Keyword(s):

Radioactive Waste ◽

State Of The Art ◽

The State ◽

High Level Radioactive Waste ◽

High Level

Download Full-text

A Meaningful Compact Key Frames Extraction in Complex Video Shots

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v7.i3.pp818-829 ◽

2017 ◽

Vol 7 (3) ◽

pp. 818

Author(s):

Manar Abduljabbar Mizher ◽

Mei Choo Ang ◽

Ahmad Abdel Jabbar Mazhar

Keyword(s):

State Of The Art ◽

The State ◽

Worst Case ◽

Key Frame Extraction ◽

Key Frame ◽

Easy Problem ◽

Extraction Algorithm ◽

Key Frames ◽

High Level ◽

The Relationship

Key frame extraction is an essential technique in the computer vision field. The extracted key frames should brief the salient events with an excellent feasibility, great efficiency, and with a high-level of robustness. Thus, it is not an easy problem to solve because it is attributed to many visual features. This paper intends to solve this problem by investigating the relationship between these features detection and the accuracy of key frames extraction techniques using TRIZ. An improved algorithm for key frame extraction was then proposed based on an accumulative optical flow with a self-adaptive threshold (AOF_ST) as recommended in TRIZ inventive principles. Several video shots including original and forgery videos with complex conditions are used to verify the experimental results. The comparison of our results with the-state-of-the-art algorithms results showed that the proposed extraction algorithm can accurately brief the videos and generated a meaningful compact count number of key frames. On top of that, our proposed algorithm achieves 124.4 and 31.4 for best and worst case in KTH dataset extracted key frames in terms of compression rate, while the the-state-of-the-art algorithms achieved 8.90 in the best case.

Download Full-text

Image Caption Generation and Comprehensive Comparison of Image Encoders

10.54216/fpa.040202 ◽

2021 ◽

pp. 42-55

Author(s):

Shitiz Gupta ◽

◽

...

Keyword(s):

Language Processing ◽

State Of The Art ◽

Image Feature ◽

Image Captioning ◽

Interactive Machine Learning ◽

Learning Techniques ◽

Comprehensive Comparison ◽

Image Caption Generation ◽

Image Caption ◽

Made In

Image caption generation is a stimulating multimodal task. Substantial advancements have been made in thefield of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transferlearning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention,along with embedded text to generate high accuracy captions. We have compared these models on severalbenchmark datasets based on different evaluation metrics like BLEU and METEOR.

Download Full-text

Guiding Attention in Sequence-to-Sequence Models for Dialogue Act Prediction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6259 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7594-7601

Author(s):

Pierre Colombo ◽

Emile Chapuis ◽

Matteo Manica ◽

Emmanuel Vignon ◽

Giovanna Varni ◽

...

Keyword(s):

Machine Translation ◽

Random Fields ◽

Conditional Random Fields ◽

State Of The Art ◽

The State ◽

Attention Mechanism ◽

Accuracy Score ◽

Beam Search ◽

Conversational Agents ◽

Neural Machine Translation

The task of predicting dialog acts (DA) based on conversational dialog is a key component in the development of conversational agents. Accurately predicting DAs requires a precise modeling of both the conversation and the global tag dependencies. We leverage seq2seq approaches widely adopted in Neural Machine Translation (NMT) to improve the modelling of tag sequentiality. Seq2seq models are known to learn complex global dependencies while currently proposed approaches using linear conditional random fields (CRF) only model local tag dependencies. In this work, we introduce a seq2seq model tailored for DA classification using: a hierarchical encoder, a novel guided attention mechanism and beam search applied to both training and inference. Compared to the state of the art our model does not require handcrafted features and is trained end-to-end. Furthermore, the proposed approach achieves an unmatched accuracy score of 85% on SwDA, and state-of-the-art accuracy score of 91.6% on MRDA.

Download Full-text

The state-of-the art of the geological investigation processes and techniques for deep borehole disposal of high-level radioactive waste

Journal of the geological society of Korea ◽

10.14770/jgsk.2016.52.1.95 ◽

2016 ◽

Vol 52 (1) ◽

pp. 95-103

Author(s):

Sung-Hoon Ji ◽

Jong-Youl Lee ◽

Yong-Kwon Koh ◽

Kyungsu Kim

Keyword(s):

Radioactive Waste ◽

State Of The Art ◽

The State ◽

Deep Borehole ◽

Geological Investigation ◽

High Level Radioactive Waste ◽

High Level

Download Full-text

TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/91 ◽

2021 ◽

Author(s):

Zhihao Fan ◽

Zhongyu Wei ◽

Siyuan Wang ◽

Ruize Wang ◽

Zejun Li ◽

...

Keyword(s):

State Of The Art ◽

Representation Learning ◽

Experimental Results ◽

Text Representation ◽

Image Captioning ◽

Scene Graph ◽

Low Level ◽

Language And Vision ◽

High Level ◽

Cross Language

Existing research for image captioning usually represents an image using a scene graph with low-level facts (objects and relations) and fails to capture the high-level semantics. In this paper, we propose a Theme Concepts extended Image Captioning (TCIC) framework that incorporates theme concepts to represent high-level cross-modality semantics. In practice, we model theme concepts as memory vectors and propose Transformer with Theme Nodes (TTN) to incorporate those vectors for image captioning. Considering that theme concepts can be learned from both images and captions, we propose two settings for their representations learning based on TTN. On the vision side, TTN is configured to take both scene graph based features and theme concepts as input for visual representation learning. On the language side, TTN is configured to take both captions and theme concepts as input for text representation re-construction. Both settings aim to generate target captions with the same transformer-based decoder. During the training, we further align representations of theme concepts learned from images and corresponding captions to enforce the cross-modality learning. Experimental results on MS COCO show the effectiveness of our approach compared to some state-of-the-art models.

Download Full-text

Text Augmentation Using BERT for Image Captioning

Applied Sciences ◽

10.3390/app10175978 ◽

2020 ◽

Vol 10 (17) ◽

pp. 5978

Author(s):

Viktar Atliha ◽

Dmitrij Šešok

Keyword(s):

Neural Networks ◽

Human Computer Interaction ◽

State Of The Art ◽

Language Model ◽

Research Field ◽

The State ◽

Training Dataset ◽

Image Description ◽

Image Captioning ◽

Training Set

Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human. In recent years, this research field has rapidly developed and a number of impressive results have been achieved. The typical models are based on a neural networks, including convolutional ones for encoding images and recurrent ones for decoding them into text. More than that, attention mechanism and transformers are actively used for boosting performance. However, even the best models have a limit in their quality with a lack of data. In order to generate a variety of descriptions of objects in different situations you need a large training set. The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image. We expanded the training dataset using text augmentation methods. Methods include augmentation with synonyms as a baseline and the state-of-the-art language model called Bidirectional Encoder Representations from Transformers (BERT). As a result, models that were trained on a datasets augmented show better results than that models trained on a dataset without augmentation.

Download Full-text

Hierarchical Attention Network for Image Captioning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018957 ◽

2019 ◽

Vol 33 ◽

pp. 8957-8964 ◽

Cited By ~ 7

Author(s):

Weixuan Wang ◽

Zhihong Chen ◽

Haifeng Hu

Keyword(s):

State Of The Art ◽

The Other ◽

Multimodal Integration ◽

Image Captioning ◽

Attention Network ◽

Spatial Features ◽

Text Features ◽

Integration Strategies ◽

High Level ◽

Hierarchical Features

Recently, attention mechanism has been successfully applied in image captioning, but the existing attention methods are only established on low-level spatial features or high-level text features, which limits richness of captions. In this paper, we propose a Hierarchical Attention Network (HAN) that enables attention to be calculated on pyramidal hierarchy of features synchronously. The pyramidal hierarchy consists of features on diverse semantic levels, which allows predicting different words according to different features. On the other hand, due to the different modalities of features, a Multivariate Residual Module (MRM) is proposed to learn the joint representations from features. The MRM is able to model projections and extract relevant relations among different features. Furthermore, we introduce a context gate to balance the contribution of different features. Compared with the existing methods, our approach applies hierarchical features and exploits several multimodal integration strategies, which can significantly improve the performance. The HAN is verified on benchmark MSCOCO dataset, and the experimental results indicate that our model outperforms the state-of-the-art methods, achieving a BLEU1 score of 80.9 and a CIDEr score of 121.7 in the Karpathy’s test split.

Download Full-text