scholarly journals Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking

2020 ◽  
Vol 34 (4) ◽  
pp. 571-584
Author(s):  
Rajarshi Biswas ◽  
Michael Barz ◽  
Daniel Sonntag

AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.

Author(s):  
Xiang Kong ◽  
Qizhe Xie ◽  
Zihang Dai ◽  
Eduard Hovy

Mixture of Softmaxes (MoS) has been shown to be effective at addressing the expressiveness limitation of Softmax-based models. Despite the known advantage, MoS is practically sealed by its large consumption of memory and computational time due to the need of computing multiple Softmaxes. In this work, we set out to unleash the power of MoS in practical applications by investigating improved word coding schemes, which could effectively reduce the vocabulary size and hence relieve the memory and computation burden. We show both BPE and our proposed Hybrid-LightRNN lead to improved encoding mechanisms that can halve the time and memory consumption of MoS without performance losses. With MoS, we achieve an improvement of 1.5 BLEU scores on IWSLT 2014 German-to-English corpus and an improvement of 0.76 CIDEr score on image captioning. Moreover, on the larger WMT 2014 machine translation dataset, our MoSboosted Transformer yields 29.6 BLEU score for English-toGerman and 42.1 BLEU score for English-to-French, outperforming the single-Softmax Transformer by 0.9 and 0.4 BLEU scores respectively and achieving the state-of-the-art result on WMT 2014 English-to-German task.


Author(s):  
Mirko Luca Lobina ◽  
Luigi Atzori ◽  
Davide Mula

Many audio watermarking techniques presented in the last years make use of masking and psychological models derived from signal processing. Such a basic idea is winning because it guarantees a high level of robustness and bandwidth of the watermark as well as fidelity of the watermarked signal. This chapter first describes the relationship between digital right management, intellectual property, and use of watermarking techniques. Then, the crossing use of watermarking and masking models is detailed, providing schemes, examples, and references. Finally, the authors present two strategies that make use of a masking model, applied to a classic watermarking technique. The joint use of classic frameworks and masking models seems to be one of the trends for the future of research in watermarking. Several tests on the proposed strategies with the state of the art are also offered to give an idea of how to assess the effectiveness of a watermarking technique.


Author(s):  
Manar Abduljabbar Mizher ◽  
Mei Choo Ang ◽  
Ahmad Abdel Jabbar Mazhar

Key frame extraction is an essential technique in the computer vision field. The extracted key frames should brief the salient events with an excellent feasibility, great efficiency, and with a high-level of robustness. Thus, it is not an easy problem to solve because it is attributed to many visual features. This paper intends to solve this problem by investigating the relationship between these features detection and the accuracy of key frames extraction techniques using TRIZ. An improved algorithm for key frame extraction was then proposed based on an accumulative optical flow with a self-adaptive threshold (AOF_ST) as recommended in TRIZ inventive principles. Several video shots including original and forgery videos with complex conditions are used to verify the experimental results. The comparison of our results with the-state-of-the-art algorithms results showed that the proposed extraction algorithm can accurately brief the videos and generated a meaningful compact count number of key frames. On top of that, our proposed algorithm achieves 124.4 and 31.4 for best and worst case in KTH dataset extracted key frames in terms of compression rate, while the the-state-of-the-art algorithms achieved 8.90 in the best case.


2021 ◽  
pp. 42-55
Author(s):  
Shitiz Gupta ◽  
◽  
◽  
◽  
◽  
...  

Image caption generation is a stimulating multimodal task. Substantial advancements have been made in thefield of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transferlearning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention,along with embedded text to generate high accuracy captions. We have compared these models on severalbenchmark datasets based on different evaluation metrics like BLEU and METEOR.


2020 ◽  
Vol 34 (05) ◽  
pp. 7594-7601
Author(s):  
Pierre Colombo ◽  
Emile Chapuis ◽  
Matteo Manica ◽  
Emmanuel Vignon ◽  
Giovanna Varni ◽  
...  

The task of predicting dialog acts (DA) based on conversational dialog is a key component in the development of conversational agents. Accurately predicting DAs requires a precise modeling of both the conversation and the global tag dependencies. We leverage seq2seq approaches widely adopted in Neural Machine Translation (NMT) to improve the modelling of tag sequentiality. Seq2seq models are known to learn complex global dependencies while currently proposed approaches using linear conditional random fields (CRF) only model local tag dependencies. In this work, we introduce a seq2seq model tailored for DA classification using: a hierarchical encoder, a novel guided attention mechanism and beam search applied to both training and inference. Compared to the state of the art our model does not require handcrafted features and is trained end-to-end. Furthermore, the proposed approach achieves an unmatched accuracy score of 85% on SwDA, and state-of-the-art accuracy score of 91.6% on MRDA.


Author(s):  
Zhihao Fan ◽  
Zhongyu Wei ◽  
Siyuan Wang ◽  
Ruize Wang ◽  
Zejun Li ◽  
...  

Existing research for image captioning usually represents an image using a scene graph with low-level facts (objects and relations) and fails to capture the high-level semantics. In this paper, we propose a Theme Concepts extended Image Captioning (TCIC) framework that incorporates theme concepts to represent high-level cross-modality semantics. In practice, we model theme concepts as memory vectors and propose Transformer with Theme Nodes (TTN) to incorporate those vectors for image captioning. Considering that theme concepts can be learned from both images and captions, we propose two settings for their representations learning based on TTN. On the vision side, TTN is configured to take both scene graph based features and theme concepts as input for visual representation learning. On the language side, TTN is configured to take both captions and theme concepts as input for text representation re-construction. Both settings aim to generate target captions with the same transformer-based decoder. During the training, we further align representations of theme concepts learned from images and corresponding captions to enforce the cross-modality learning. Experimental results on MS COCO show the effectiveness of our approach compared to some state-of-the-art models.


2020 ◽  
Vol 10 (17) ◽  
pp. 5978
Author(s):  
Viktar Atliha ◽  
Dmitrij Šešok

Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human. In recent years, this research field has rapidly developed and a number of impressive results have been achieved. The typical models are based on a neural networks, including convolutional ones for encoding images and recurrent ones for decoding them into text. More than that, attention mechanism and transformers are actively used for boosting performance. However, even the best models have a limit in their quality with a lack of data. In order to generate a variety of descriptions of objects in different situations you need a large training set. The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image. We expanded the training dataset using text augmentation methods. Methods include augmentation with synonyms as a baseline and the state-of-the-art language model called Bidirectional Encoder Representations from Transformers (BERT). As a result, models that were trained on a datasets augmented show better results than that models trained on a dataset without augmentation.


Author(s):  
Weixuan Wang ◽  
Zhihong Chen ◽  
Haifeng Hu

Recently, attention mechanism has been successfully applied in image captioning, but the existing attention methods are only established on low-level spatial features or high-level text features, which limits richness of captions. In this paper, we propose a Hierarchical Attention Network (HAN) that enables attention to be calculated on pyramidal hierarchy of features synchronously. The pyramidal hierarchy consists of features on diverse semantic levels, which allows predicting different words according to different features. On the other hand, due to the different modalities of features, a Multivariate Residual Module (MRM) is proposed to learn the joint representations from features. The MRM is able to model projections and extract relevant relations among different features. Furthermore, we introduce a context gate to balance the contribution of different features. Compared with the existing methods, our approach applies hierarchical features and exploits several multimodal integration strategies, which can significantly improve the performance. The HAN is verified on benchmark MSCOCO dataset, and the experimental results indicate that our model outperforms the state-of-the-art methods, achieving a BLEU1 score of 80.9 and a CIDEr score of 121.7 in the Karpathy’s test split.


Sign in / Sign up

Export Citation Format

Share Document