Overview on Image Captioning Techniques

Image captioning is a process to assign a meaningful title for a given image with the help of Natural Language Processing (NLP) and Computer Vision techniques. Captioning of an image first need to identify object, attribute and relationship among these in image and second is to generate relevant description for the given image. So it require both NLP and Computer vision techniques to perform image captioning task. Due to complexity of finding relationship between the attribute of the object and its feature makes it a challenging task. Also for machine it is difficult to emulate human brain however researches have shown a prominent achievement in this field and made it easy to solve such problems. The foremost aim of this survey paper is to describe several methods to achieve the same, the core involvement of this paper is to categorise different existing approaches for image captioning, further discussed their subcategories of this method and classify them, also discussed some of their strength and limitations. This survey paper gives theoretical analysis of image captioning methods and defines some earlier and newly approach for image captioning. This survey paper is basically a source of information for researchers in order to get idea of different approaches that were developed so far in the field of image captioning. Key words : Computer Vision, Deep Learning, Neural Network, NLP, Image Captioning, Multimodal Learning.

Download Full-text

A Hindi Image Caption Generation Framework Using Deep Learning

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3432246 ◽

2021 ◽

Vol 20 (2) ◽

pp. 1-19

Author(s):

Santosh Kumar Mishra ◽

Rijul Dhir ◽

Sriparna Saha ◽

Pushpak Bhattacharyya

Keyword(s):

Computer Vision ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

English Language ◽

Image Captioning ◽

Textual Description ◽

Proposed Model ◽

Hindi Language ◽

The Given

Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision is used for understanding images, and natural language processing is used for language modeling. A lot of works have been done for image captioning for the English language. In this article, we have developed a model for image captioning in the Hindi language. Hindi is the official language of India, and it is the fourth most spoken language in the world, spoken in India and South Asia. To the best of our knowledge, this is the first attempt to generate image captions in the Hindi language. A dataset is manually created by translating well known MSCOCO dataset from English to Hindi. Finally, different types of attention-based architectures are developed for image captioning in the Hindi language. These attention mechanisms are new for the Hindi language, as those have never been used for the Hindi language. The obtained results of the proposed model are compared with several baselines in terms of BLEU scores, and the results show that our model performs better than others. Manual evaluation of the obtained captions in terms of adequacy and fluency also reveals the effectiveness of our proposed approach. Availability of resources : The codes of the article are available at https://github.com/santosh1821cs03/Image_Captioning_Hindi_Language ; The dataset will be made available: http://www.iitp.ac.in/∼ai-nlp-ml/resources.html .

Download Full-text

Image Caption Generator

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c8383.0110321 ◽

2021 ◽

Vol 10 (3) ◽

pp. 87-92

Author(s):

Megha J Panicker ◽

Vikas Upadhayay ◽

Gunjan Sethi ◽

Vrinda Mathur

Keyword(s):

Computer Vision ◽

Language Processing ◽

Network Models ◽

Image Captioning ◽

Neural Network Models ◽

Visually Impaired People ◽

Video Frames ◽

Modern Era ◽

Deep Learning Model ◽

Image Caption

In the modern era, image captioning has become one of the most widely required tools. Moreover, there are inbuilt applications that generate and provide a caption for a certain image, all these things are done with the help of deep neural network models. The process of generating a description of an image is called image captioning. It requires recognizing the important objects, their attributes, and the relationships among the objects in an image. It generates syntactically and semantically correct sentences.In this paper, we present a deep learning model to describe images and generate captions using computer vision and machine translation. This paper aims to detect different objects found in an image, recognize the relationships between those objects and generate captions. The dataset used is Flickr8k and the programming language used was Python3, and an ML technique called Transfer Learning will be implemented with the help of the Xception model, to demonstrate the proposed experiment. This paper will also elaborate on the functions and structure of the various Neural networks involved. Generating image captions is an important aspect of Computer Vision and Natural language processing. Image caption generators can find applications in Image segmentation as used by Facebook and Google Photos, and even more so, its use can be extended to video frames. They will easily automate the job of a person who has to interpret images. Not to mention it has immense scope in helping visually impaired people.

Download Full-text

Automatic Image Captioning Using Neural Networks

Journal of Innovations in Engineering Education ◽

10.3126/jiee.v3i1.34335 ◽

2020 ◽

Vol 3 (1) ◽

pp. 138-146

Author(s):

Subash Pandey ◽

Rabin Kumar Dhamala ◽

Bikram Karki ◽

Saroj Dahal ◽

Rama Bastola

Keyword(s):

Neural Network ◽

Artificial Intelligence ◽

Neural Networks ◽

Computer Vision ◽

Natural Language ◽

Language Processing ◽

Model Performance ◽

Image Description ◽

Image Captioning ◽

Top Down

Automatically generating a natural language description of an image is a major challenging task in the field of artificial intelligence. Generating description of an image bring together the fields: Natural Language Processing and Computer Vision. There are two types of approaches i.e. top-down and bottom-up. For this paper, we approached top-down that starts from the image and converts it into the word. Image is passed to Convolutional Neural Network (CNN) encoder and the output from it is fed further to Recurrent Neural Network (RNN) decoder that generates meaningful captions. We generated the image description by passing the real time images from the camera of a smartphone as well as tested with the test images from the dataset. To evaluate the model performance, we used BLEU (Bilingual Evaluation Understudy) score and match predicted words to the original caption.

Download Full-text

CNN Variants for Computer Vision: History, Architecture, Application, Challenges and Future Scope

Electronics ◽

10.3390/electronics10202470 ◽

2021 ◽

Vol 10 (20) ◽

pp. 2470

Author(s):

Dulari Bhatt ◽

Chirag Patel ◽

Hardik Talsania ◽

Jigar Patel ◽

Rasmika Vaghela ◽

...

Keyword(s):

Computer Vision ◽

Language Processing ◽

Video Processing ◽

Spatial Information ◽

Regularization Parameter ◽

Research Direction ◽

Alternative Activation ◽

Future Research ◽

Survey Paper ◽

Deep Cnn

Computer vision is becoming an increasingly trendy word in the area of image processing. With the emergence of computer vision applications, there is a significant demand to recognize objects automatically. Deep CNN (convolution neural network) has benefited the computer vision community by producing excellent results in video processing, object recognition, picture classification and segmentation, natural language processing, speech recognition, and many other fields. Furthermore, the introduction of large amounts of data and readily available hardware has opened new avenues for CNN study. Several inspirational concepts for the progress of CNN have been investigated, including alternative activation functions, regularization, parameter optimization, and architectural advances. Furthermore, achieving innovations in architecture results in a tremendous enhancement in the capacity of the deep CNN. Significant emphasis has been given to leveraging channel and spatial information, with a depth of architecture and information processing via multi-path. This survey paper focuses mainly on the primary taxonomy and newly released deep CNN architectures, and it divides numerous recent developments in CNN architectures into eight groups. Spatial exploitation, multi-path, depth, breadth, dimension, channel boosting, feature-map exploitation, and attention-based CNN are the eight categories. The main contribution of this manuscript is in comparing various architectural evolutions in CNN by its architectural change, strengths, and weaknesses. Besides, it also includes an explanation of the CNN’s components, the strengths and weaknesses of various CNN variants, research gap or open challenges, CNN applications, and the future research direction.

Download Full-text

A Systematic Review on Data Scarcity Problem in Deep Learning: Solution and Applications

ACM Computing Surveys ◽

10.1145/3502287 ◽

2022 ◽

Author(s):

Ms. Aayushi Bansal ◽

Dr. Rewa Sharma ◽

Dr. Mamta Kathuria

Keyword(s):

Neural Networks ◽

Computer Vision ◽

Deep Learning ◽

Language Processing ◽

Data Augmentation ◽

Medical Science ◽

Real Life ◽

Survey Paper ◽

Augmentation Techniques ◽

Comprehensive Survey

Recent advancements in deep learning architecture have increased its utility in real-life applications. Deep learning models require a large amount of data to train the model. In many application domains, there is a limited set of data available for training neural networks as collecting new data is either not feasible or requires more resources such as in marketing, computer vision, and medical science. These models require a large amount of data to avoid the problem of overfitting. One of the data space solutions to the problem of limited data is data augmentation. The purpose of this study focuses on various data augmentation techniques that can be used to further improve the accuracy of a neural network. This saves the cost and time consumption required to collect new data for the training of deep neural networks by augmenting available data. This also regularizes the model and improves its capability of generalization. The need for large datasets in different fields such as computer vision, natural language processing, security and healthcare is also covered in this survey paper. The goal of this paper is to provide a comprehensive survey of recent advancements in data augmentation techniques and their application in various domains.

Download Full-text

A Modularized Architecture of Multi-Branch Convolutional Neural Network for Image Captioning

Electronics ◽

10.3390/electronics8121417 ◽

2019 ◽

Vol 8 (12) ◽

pp. 1417 ◽

Cited By ~ 1

Author(s):

Shan He ◽

Yuanyao Lu

Keyword(s):

Neural Network ◽

Computer Vision ◽

Convolutional Neural Network ◽

Language Processing ◽

State Of The Art ◽

Input Image ◽

Simple Design ◽

Complete Conversion ◽

Practical Application ◽

Image Captioning

Image captioning is a comprehensive task in computer vision (CV) and natural language processing (NLP). It can complete conversion from image to text, that is, the algorithm automatically generates corresponding descriptive text according to the input image. In this paper, we present an end-to-end model that takes deep convolutional neural network (CNN) as the encoder and recurrent neural network (RNN) as the decoder. In order to get better image captioning extraction, we propose a highly modularized multi-branch CNN, which could increase accuracy while maintaining the number of hyper-parameters unchanged. This strategy provides a simply designed network consists of parallel sub-modules of the same structure. While traditional CNN goes deeper and wider to increase accuracy, our proposed method is more effective with a simple design, which is easier to optimize for practical application. Experiments are conducted on Flickr8k, Flickr30k and MSCOCO entities. Results demonstrate that our method achieves state of the art performances in terms of caption quality.

Download Full-text

Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey

10.31219/osf.io/m6gcn ◽

2020 ◽

Author(s):

Benyamin Ghojogh ◽

Ali Ghodsi

Keyword(s):

Computer Vision ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Attention Mechanism ◽

Survey Paper

This is a tutorial and survey paper on the attention mechanism, transformers, BERT, and GPT. We first explain attention mechanism, sequence-to-sequence model without and with attention, self-attention, and attention in different areas such as natural language processing and computer vision. Then, we explain transformers which do not use any recurrence. We explain all the parts of encoder and decoder in the transformer, including positional encoding, multihead self-attention and cross-attention, and masked multihead attention. Thereafter, we introduce the Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT) as the stacks of encoders and decoders of transformer, respectively. We explain their characteristics and how they work.

Download Full-text

Multimodal encoders and decoders with gate attention for visual question answering

Computer Science and Information Systems ◽

10.2298/csis201120032l ◽

2021 ◽

pp. 32-32

Author(s):

Haiyan Li ◽

Dezhi Han

Keyword(s):

Computer Vision ◽

Natural Language Processing ◽

Language Processing ◽

Question Answering ◽

Visual Features ◽

Modal Interactions ◽

Crossmodal Attention ◽

The Core ◽

Visual Question Answering ◽

Language Modality

Visual Question Answering (VQA) is a multimodal research related to Computer Vision (CV) and Natural Language Processing (NLP). How to better obtain useful information from images and questions and give an accurate answer to the question is the core of the VQA task. This paper presents a VQA model based on multimodal encoders and decoders with gate attention (MEDGA). Each encoder and decoder block in the MEDGA applies not only self-attention and crossmodal attention but also gate attention, so that the new model can better focus on inter-modal and intra-modal interactions simultaneously within visual and language modality. Besides, MEDGA further filters out noise information irrelevant to the results via gate attention and finally outputs attention results that are closely related to visual features and language features, which makes the answer prediction result more accurate. Experimental evaluations on the VQA 2.0 dataset and the ablation experiments under different conditions prove the effectiveness of MEDGA. In addition, the MEDGA accuracy on the test-std dataset has reached 70.11%, which exceeds many existing methods.

Download Full-text

Automatic Image Captioning Methods

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.g1013.0597s20 ◽

2020 ◽

Vol 9 (7S) ◽

pp. 93-97

Keyword(s):

Natural Language ◽

Language Processing ◽

Short Term Memory ◽

Image Understanding ◽

Image Captioning ◽

Short Term ◽

Natural Languages ◽

Term Memory ◽

Long Short Term Memory ◽

The Given

A language known to humans is a natural language. In computer science it is the most challenging task to make the computers understand the natural languages and generating caption automatically from the given image. While a lot of work has been done, the total solution to this problem has been demonstrated daunting so far. Image captioning is a crucial job involving linguistic image understanding and the ability to generate interpretation of sentences with proper and accurate structure. It requires expertise in Image processing and natural language processing. The publishers suggest in this practice a system using the multilayer Convolutional Neural Network (CNN) to generate language describing the images and Long Short Term Memory (LSTM) to concisely frame relevant phrases using the driven keywords. We aim in this article to provide a brief overview of current methods and algorithms of image captioning using deep learning. We also address datasets and measurement criteria widely used for the same.

Download Full-text

Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3483597 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-17

Author(s):

Santosh Kumar Mishra ◽

Gaurav Rai ◽

Sriparna Saha ◽

Pushpak Bhattacharyya

Keyword(s):

Computer Vision ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

English Language ◽

Image Understanding ◽

Attention Mechanism ◽

Image Captioning ◽

Textual Description ◽

Hindi Language

Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.

Download Full-text