scholarly journals Image Caption Generation and Comprehensive Comparison of Image Encoders

2021 ◽  
pp. 42-55
Author(s):  
Shitiz Gupta ◽  
◽  
◽  
◽  
◽  
...  

Image caption generation is a stimulating multimodal task. Substantial advancements have been made in thefield of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transferlearning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention,along with embedded text to generate high accuracy captions. We have compared these models on severalbenchmark datasets based on different evaluation metrics like BLEU and METEOR.

2018 ◽  
Vol 8 (10) ◽  
pp. 1850 ◽  
Author(s):  
Zhibin Guan ◽  
Kang Liu ◽  
Yan Ma ◽  
Xu Qian ◽  
Tongkai Ji

Image caption generation is attractive research which focuses on generating natural language sentences to describe the visual content of a given image. It is an interdisciplinary subject combining computer vision (CV) and natural language processing (NLP). The existing image captioning methods are mainly focused on generating the final image caption directly, which may lose significant identification information of objects contained in the raw image. Therefore, we propose a new middle-level attribute-based language retouching (MLALR) method to solve this problem. Our proposed MLALR method uses the middle-level attributes predicted from the object regions to retouch the intermediate image description, which is generated by our language generation model. The advantage of our MLALR method is that it can correct descriptive errors in the intermediate image description and make the final image caption more accurate. Moreover, evaluation using benchmark datasets—MSCOCO, Flickr8K, and Flickr30K—validated the impressive performance of our MLALR method with evaluation metrics—BLEU, METEOR, ROUGE-L, CIDEr, and SPICE.


Author(s):  
Chengxi Li ◽  
Brent Harrison

In this paper, we build a multi-style generative model for stylish image captioning which uses multi-modality image features, ResNeXt features, and text features generated by DenseCap. We propose the 3M model, a Multi-UPDOWN caption model that encodes multi-modality features and decodes them into captions. We demonstrate the effectiveness of our model on generating human-like captions by examining its performance on two datasets, the PERSONALITY-CAPTIONS dataset, and the FlickrStyle10K dataset. We compare against a variety of state-of-the-art baselines on various automatic NLP metrics such as BLEU, ROUGE-L, CIDEr, SPICE, etc \footnote{code will be available at https://github.com/cici-ai-club/3M}. A qualitative study has also been done to verify our 3M model can be used for generating different stylized captions.


Artificial intelligence has open doors to new opportunities for research and development. And the recent development in machine learning and deep learning has paved a way to deal with complex problem easily. Now a day’s every aspect of human life can now be thought of as a problem statement that can be implemented and is useful in one way or the other. One such aspect is human ability to understand and describe the surrounding to which they interact and take decision accordingly. This ability can also be used in machines or bots to make human and machine interaction easier. Generating captions for images is same we need to describe images based on what you see. This task can be considered as a combination of computer vision and natural language processing. In this paper we performs a survey of various methods and techniques that can be useful in understanding how this task can be done. The survey mainly focuses on neural network techniques, because they give state of the art results.


Author(s):  
Chang Liu ◽  
Fuchun Sun ◽  
Changhu Wang ◽  
Feng Wang ◽  
Alan Yuille

In this work we formulate the problem of image captioning as a multimodal translation task. Analogous to machine translation, we present a sequence-to-sequence recurrent neural networks (RNN) model for image caption generation. Different from most existing work where the whole image is represented by convolutional neural network (CNN) feature, we propose to represent the input image as a sequence of detected objects which feeds as the source sequence of the RNN model. In this way, the sequential representation of an image can be naturally translated to a sequence of words, as the target sequence of the RNN model. To represent the image in a sequential way, we extract the objects features in the image and arrange them in a order using convolutional neural networks. To further leverage the visual information from the encoded objects, a sequential attention layer is introduced to selectively attend to the objects that are related to generate corresponding words in the sentences. Extensive experiments are conducted to validate the proposed approach on popular benchmark dataset, i.e., MS COCO, and the proposed model surpasses the state-of-the-art methods in all metrics following the dataset splits of previous work. The proposed approach is also evaluated by the evaluation server of MS COCO captioning challenge, and achieves very competitive results, e.g., a CIDEr of 1.029 (c5) and 1.064 (c40).


Author(s):  
Chaitrali Prasanna Chaudhari ◽  
Satish Devane

“Image Captioning is the process of generating a textual description of an image”. It deploys both computer vision and natural language processing for caption generation. However, the majority of the image captioning systems offer unclear depictions regarding the objects like “man”, “woman”, “group of people”, “building”, etc. Hence, this paper intends to develop an intelligent-based image captioning model. The adopted model comprises of few steps like word generation, sentence formation, and caption generation. Initially, the input image is subjected to the Deep learning classifier called Convolutional Neural Network (CNN). Since the classifier is already trained in the relevant words that are related to all images, it can easily classify the associated words of the given image. Further, a set of sentences is formed with the generated words using Long-Short Term Memory (LSTM) model. The likelihood of the formed sentences is computed using the Maximum Likelihood (ML) function, and the sentences with higher probability are taken, which is further used for generating the visual representation of the scene in terms of image caption. As a major novelty, this paper aims to enhance the performance of CNN by optimally tuning its weight and activation function. This paper introduces a new enhanced optimization algorithm Rider with Randomized Bypass and Over-taker update (RR-BOU) for this optimal selection. In the proposed RR-BOU is the enhanced version of the Rider Optimization Algorithm (ROA). Finally, the performance of the proposed captioning model is compared over other conventional models with respect to statistical analysis.


2020 ◽  
Vol 34 (4) ◽  
pp. 571-584
Author(s):  
Rajarshi Biswas ◽  
Michael Barz ◽  
Daniel Sonntag

AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.


Author(s):  
Megha J Panicker ◽  
Vikas Upadhayay ◽  
Gunjan Sethi ◽  
Vrinda Mathur

In the modern era, image captioning has become one of the most widely required tools. Moreover, there are inbuilt applications that generate and provide a caption for a certain image, all these things are done with the help of deep neural network models. The process of generating a description of an image is called image captioning. It requires recognizing the important objects, their attributes, and the relationships among the objects in an image. It generates syntactically and semantically correct sentences.In this paper, we present a deep learning model to describe images and generate captions using computer vision and machine translation. This paper aims to detect different objects found in an image, recognize the relationships between those objects and generate captions. The dataset used is Flickr8k and the programming language used was Python3, and an ML technique called Transfer Learning will be implemented with the help of the Xception model, to demonstrate the proposed experiment. This paper will also elaborate on the functions and structure of the various Neural networks involved. Generating image captions is an important aspect of Computer Vision and Natural language processing. Image caption generators can find applications in Image segmentation as used by Facebook and Google Photos, and even more so, its use can be extended to video frames. They will easily automate the job of a person who has to interpret images. Not to mention it has immense scope in helping visually impaired people.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Tianlong Gu ◽  
Hongliang Chen ◽  
Chenzhong Bin ◽  
Liang Chang ◽  
Wei Chen

Deep learning systems have been phenomenally successful in the fields of computer vision, speech recognition, and natural language processing. Recently, researchers have adopted deep learning techniques to tackle collaborative filtering with implicit feedback. However, the existing methods generally profile both users and items directly, while neglecting the similarities between users’ and items’ neighborhoods. To this end, we propose the neighborhood attentional memory networks (NAMN), a deep learning recommendation model applying two dedicated memory networks to capture users’ neighborhood relations and items’ neighborhood relations respectively. Specifically, we first design the user neighborhood component and the item neighborhood component based on memory networks and attention mechanisms. Then, by the associative addressing scheme with the user and item memories in the neighborhood components, we capture the complex user-item neighborhood relations. Stacking multiple memory modules together yields deeper architectures exploring higher-order complex user-item neighborhood relations. Finally, the output module jointly exploits the user and item neighborhood information with the user and item memories to obtain the ranking score. Extensive experiments on three real-world datasets demonstrate significant improvements of the proposed NAMN method over the state-of-the-art methods.


2020 ◽  
Vol 2020 ◽  
pp. 1-13 ◽  
Author(s):  
Haoran Wang ◽  
Yue Zhang ◽  
Xiaosheng Yu

In recent years, with the rapid development of artificial intelligence, image caption has gradually attracted the attention of many researchers in the field of artificial intelligence and has become an interesting and arduous task. Image caption, automatically generating natural language descriptions according to the content observed in an image, is an important part of scene understanding, which combines the knowledge of computer vision and natural language processing. The application of image caption is extensive and significant, for example, the realization of human-computer interaction. This paper summarizes the related methods and focuses on the attention mechanism, which plays an important role in computer vision and is recently widely used in image caption generation tasks. Furthermore, the advantages and the shortcomings of these methods are discussed, providing the commonly used datasets and evaluation criteria in this field. Finally, this paper highlights some open challenges in the image caption task.


Author(s):  
Kemal Oflazer

Morphology is the study of the structure of words and how words are forme3d by combining smaller units of linguistic information called morphemes. Any natural language processing application will need to computationally process the words in a language before any of the more complex processing is done. This is especially a must for morphologically complex languages. After a compact overview of the basic concepts in morphology, this chapter presents the state-of-the-art computational approaches to morphology, concentrating on two-level morphology and cascaded-rules and describing how morphographemics and morphotactics are handled in a finite-state setting. The chapter then summarizes recent approaches to how machine learning techniques are applied in morphological processing.


Sign in / Sign up

Export Citation Format

Share Document