Image Caption Generation and Comprehensive Comparison of Image Encoders

Image caption generation is a stimulating multimodal task. Substantial advancements have been made in thefield of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transferlearning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention,along with embedded text to generate high accuracy captions. We have compared these models on severalbenchmark datasets based on different evaluation metrics like BLEU and METEOR.

Download Full-text

Middle-Level Attribute-Based Language Retouching for Image Caption Generation

Applied Sciences ◽

10.3390/app8101850 ◽

2018 ◽

Vol 8 (10) ◽

pp. 1850 ◽

Cited By ~ 1

Author(s):

Zhibin Guan ◽

Kang Liu ◽

Yan Ma ◽

Xu Qian ◽

Tongkai Ji

Keyword(s):

Natural Language ◽

Language Processing ◽

Middle Level ◽

Generation Model ◽

Image Description ◽

Image Captioning ◽

Benchmark Datasets ◽

Intermediate Image ◽

Image Caption Generation ◽

Image Caption

Image caption generation is attractive research which focuses on generating natural language sentences to describe the visual content of a given image. It is an interdisciplinary subject combining computer vision (CV) and natural language processing (NLP). The existing image captioning methods are mainly focused on generating the final image caption directly, which may lose significant identification information of objects contained in the raw image. Therefore, we propose a new middle-level attribute-based language retouching (MLALR) method to solve this problem. Our proposed MLALR method uses the middle-level attributes predicted from the object regions to retouch the intermediate image description, which is generated by our language generation model. The advantage of our MLALR method is that it can correct descriptive errors in the intermediate image description and make the final image caption more accurate. Moreover, evaluation using benchmark datasets—MSCOCO, Flickr8K, and Flickr30K—validated the impressive performance of our MLALR method with evaluation metrics—BLEU, METEOR, ROUGE-L, CIDEr, and SPICE.

Download Full-text

3M: Multi-style image caption generation using Multi-modality features under Multi-UPDOWN model

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i1.128380 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

Chengxi Li ◽

Brent Harrison

Keyword(s):

Qualitative Study ◽

State Of The Art ◽

Image Features ◽

Generative Model ◽

Image Captioning ◽

Text Features ◽

Image Caption Generation ◽

Image Caption

In this paper, we build a multi-style generative model for stylish image captioning which uses multi-modality image features, ResNeXt features, and text features generated by DenseCap. We propose the 3M model, a Multi-UPDOWN caption model that encodes multi-modality features and decodes them into captions. We demonstrate the effectiveness of our model on generating human-like captions by examining its performance on two datasets, the PERSONALITY-CAPTIONS dataset, and the FlickrStyle10K dataset. We compare against a variety of state-of-the-art baselines on various automatic NLP metrics such as BLEU, ROUGE-L, CIDEr, SPICE, etc \footnote{code will be available at https://github.com/cici-ai-club/3M}. A qualitative study has also been done to verify our 3M model can be used for generating different stylized captions.

Download Full-text

Analysis of Different Neural Network Techniques Used for Image Caption Generation

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i7165.079920 ◽

2020 ◽

Vol 9 (9) ◽

pp. 299-304

Keyword(s):

Neural Network ◽

Language Processing ◽

State Of The Art ◽

Human Life ◽

Complex Problem ◽

Problem Statement ◽

Human Ability ◽

Machine Interaction ◽

Image Caption Generation ◽

Image Caption

Artificial intelligence has open doors to new opportunities for research and development. And the recent development in machine learning and deep learning has paved a way to deal with complex problem easily. Now a day’s every aspect of human life can now be thought of as a problem statement that can be implemented and is useful in one way or the other. One such aspect is human ability to understand and describe the surrounding to which they interact and take decision accordingly. This ability can also be used in machines or bots to make human and machine interaction easier. Generating captions for images is same we need to describe images based on what you see. This task can be considered as a combination of computer vision and natural language processing. In this paper we performs a survey of various methods and techniques that can be useful in understanding how this task can be done. The survey mainly focuses on neural network techniques, because they give state of the art results.

Download Full-text

MAT: A Multimodal Attentive Translator for Image Captioning

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/563 ◽

2017 ◽

Cited By ~ 21

Author(s):

Chang Liu ◽

Fuchun Sun ◽

Changhu Wang ◽

Feng Wang ◽

Alan Yuille

Keyword(s):

Neural Networks ◽

Recurrent Neural Networks ◽

Visual Information ◽

State Of The Art ◽

Input Image ◽

Target Sequence ◽

Image Captioning ◽

Proposed Model ◽

Image Caption Generation ◽

Image Caption

In this work we formulate the problem of image captioning as a multimodal translation task. Analogous to machine translation, we present a sequence-to-sequence recurrent neural networks (RNN) model for image caption generation. Different from most existing work where the whole image is represented by convolutional neural network (CNN) feature, we propose to represent the input image as a sequence of detected objects which feeds as the source sequence of the RNN model. In this way, the sequential representation of an image can be naturally translated to a sequence of words, as the target sequence of the RNN model. To represent the image in a sequential way, we extract the objects features in the image and arrange them in a order using convolutional neural networks. To further leverage the visual information from the encoded objects, a sequential attention layer is introduced to selectively attend to the objects that are related to generate corresponding words in the sentences. Extensive experiments are conducted to validate the proposed approach on popular benchmark dataset, i.e., MS COCO, and the proposed model surpasses the state-of-the-art methods in all metrics following the dataset splits of previous work. The proposed approach is also evaluated by the evaluation server of MS COCO captioning challenge, and achieves very competitive results, e.g., a CIDEr of 1.029 (c5) and 1.064 (c40).

Download Full-text

Improved Framework using Rider Optimization Algorithm for Precise Image Caption Generation

International Journal of Image and Graphics ◽

10.1142/s0219467822500218 ◽

2021 ◽

pp. 2250021

Author(s):

Chaitrali Prasanna Chaudhari ◽

Satish Devane

Keyword(s):

Language Processing ◽

Optimization Algorithm ◽

Short Term Memory ◽

Activation Function ◽

Input Image ◽

Image Captioning ◽

Word Generation ◽

Adopted Model ◽

Conventional Models ◽

Image Caption

“Image Captioning is the process of generating a textual description of an image”. It deploys both computer vision and natural language processing for caption generation. However, the majority of the image captioning systems offer unclear depictions regarding the objects like “man”, “woman”, “group of people”, “building”, etc. Hence, this paper intends to develop an intelligent-based image captioning model. The adopted model comprises of few steps like word generation, sentence formation, and caption generation. Initially, the input image is subjected to the Deep learning classifier called Convolutional Neural Network (CNN). Since the classifier is already trained in the relevant words that are related to all images, it can easily classify the associated words of the given image. Further, a set of sentences is formed with the generated words using Long-Short Term Memory (LSTM) model. The likelihood of the formed sentences is computed using the Maximum Likelihood (ML) function, and the sentences with higher probability are taken, which is further used for generating the visual representation of the scene in terms of image caption. As a major novelty, this paper aims to enhance the performance of CNN by optimally tuning its weight and activation function. This paper introduces a new enhanced optimization algorithm Rider with Randomized Bypass and Over-taker update (RR-BOU) for this optimal selection. In the proposed RR-BOU is the enhanced version of the Rider Optimization Algorithm (ROA). Finally, the performance of the proposed captioning model is compared over other conventional models with respect to statistical analysis.

Download Full-text

Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking

KI - Künstliche Intelligenz ◽

10.1007/s13218-020-00679-2 ◽

2020 ◽

Vol 34 (4) ◽

pp. 571-584

Author(s):

Rajarshi Biswas ◽

Michael Barz ◽

Daniel Sonntag

Keyword(s):

State Of The Art ◽

Input Image ◽

The State ◽

Beam Search ◽

Image Captioning ◽

Bottom Up ◽

Interactive Machine Learning ◽

Joint Embedding ◽

Bounding Boxes ◽

High Level

AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.

Download Full-text

Image Caption Generator

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c8383.0110321 ◽

2021 ◽

Vol 10 (3) ◽

pp. 87-92

Author(s):

Megha J Panicker ◽

Vikas Upadhayay ◽

Gunjan Sethi ◽

Vrinda Mathur

Keyword(s):

Computer Vision ◽

Language Processing ◽

Network Models ◽

Image Captioning ◽

Neural Network Models ◽

Visually Impaired People ◽

Video Frames ◽

Modern Era ◽

Deep Learning Model ◽

Image Caption

In the modern era, image captioning has become one of the most widely required tools. Moreover, there are inbuilt applications that generate and provide a caption for a certain image, all these things are done with the help of deep neural network models. The process of generating a description of an image is called image captioning. It requires recognizing the important objects, their attributes, and the relationships among the objects in an image. It generates syntactically and semantically correct sentences.In this paper, we present a deep learning model to describe images and generate captions using computer vision and machine translation. This paper aims to detect different objects found in an image, recognize the relationships between those objects and generate captions. The dataset used is Flickr8k and the programming language used was Python3, and an ML technique called Transfer Learning will be implemented with the help of the Xception model, to demonstrate the proposed experiment. This paper will also elaborate on the functions and structure of the various Neural networks involved. Generating image captions is an important aspect of Computer Vision and Natural language processing. Image caption generators can find applications in Image segmentation as used by Facebook and Google Photos, and even more so, its use can be extended to video frames. They will easily automate the job of a person who has to interpret images. Not to mention it has immense scope in helping visually impaired people.

Download Full-text

Neighborhood Attentional Memory Networks for Recommendation Systems

Scientific Programming ◽

10.1155/2021/8880331 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Tianlong Gu ◽

Hongliang Chen ◽

Chenzhong Bin ◽

Liang Chang ◽

Wei Chen

Keyword(s):

Deep Learning ◽

Language Processing ◽

State Of The Art ◽

Implicit Feedback ◽

Learning Techniques ◽

Ranking Score ◽

Memory Modules ◽

Output Module ◽

Neighborhood Relations ◽

Real World Datasets

Deep learning systems have been phenomenally successful in the fields of computer vision, speech recognition, and natural language processing. Recently, researchers have adopted deep learning techniques to tackle collaborative filtering with implicit feedback. However, the existing methods generally profile both users and items directly, while neglecting the similarities between users’ and items’ neighborhoods. To this end, we propose the neighborhood attentional memory networks (NAMN), a deep learning recommendation model applying two dedicated memory networks to capture users’ neighborhood relations and items’ neighborhood relations respectively. Specifically, we first design the user neighborhood component and the item neighborhood component based on memory networks and attention mechanisms. Then, by the associative addressing scheme with the user and item memories in the neighborhood components, we capture the complex user-item neighborhood relations. Stacking multiple memory modules together yields deeper architectures exploring higher-order complex user-item neighborhood relations. Finally, the output module jointly exploits the user and item neighborhood information with the user and item memories to obtain the ranking score. Extensive experiments on three real-world datasets demonstrate significant improvements of the proposed NAMN method over the state-of-the-art methods.

Download Full-text

An Overview of Image Caption Generation Methods

Computational Intelligence and Neuroscience ◽

10.1155/2020/3062706 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13 ◽

Cited By ~ 2

Author(s):

Haoran Wang ◽

Yue Zhang ◽

Xiaosheng Yu

Keyword(s):

Artificial Intelligence ◽

Computer Vision ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Rapid Development ◽

Evaluation Criteria ◽

Arduous Task ◽

Image Caption Generation ◽

Image Caption

In recent years, with the rapid development of artificial intelligence, image caption has gradually attracted the attention of many researchers in the field of artificial intelligence and has become an interesting and arduous task. Image caption, automatically generating natural language descriptions according to the content observed in an image, is an important part of scene understanding, which combines the knowledge of computer vision and natural language processing. The application of image caption is extensive and significant, for example, the realization of human-computer interaction. This paper summarizes the related methods and focuses on the attention mechanism, which plays an important role in computer vision and is recently widely used in image caption generation tasks. Furthermore, the advantages and the shortcomings of these methods are discussed, providing the commonly used datasets and evaluation criteria in this field. Finally, this paper highlights some open challenges in the image caption task.

Download Full-text

Morphology

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.006 ◽

2017 ◽

Author(s):

Kemal Oflazer

Keyword(s):

Machine Learning ◽

Language Processing ◽

State Of The Art ◽

Morphological Processing ◽

Machine Learning Techniques ◽

Processing Application ◽

Computational Approaches ◽

Learning Techniques ◽

Finite State ◽

Basic Concepts

Morphology is the study of the structure of words and how words are forme3d by combining smaller units of linguistic information called morphemes. Any natural language processing application will need to computationally process the words in a language before any of the more complex processing is done. This is especially a must for morphologically complex languages. After a compact overview of the basic concepts in morphology, this chapter presents the state-of-the-art computational approaches to morphology, concentrating on two-level morphology and cascaded-rules and describing how morphographemics and morphotactics are handled in a finite-state setting. The chapter then summarizes recent approaches to how machine learning techniques are applied in morphological processing.

Download Full-text