scholarly journals Image Captioning using Facial Expression and Attention

2020 ◽  
Vol 68 ◽  
pp. 661-689
Author(s):  
Omid Mohamad Nezami ◽  
Mark Dras ◽  
Stephen Wan ◽  
Cecile Paris

Benefiting from advances in machine vision and natural language processing techniques, current image captioning systems are able to generate detailed visual descriptions. For the most part, these descriptions represent an objective characterisation of the image, although some models do incorporate subjective aspects related to the observer’s view of the image, such as sentiment; current models, however, usually do not consider the emotional content of images during the caption generation process. This paper addresses this issue by proposing novel image captioning models which use facial expression features to generate image captions. The models generate image captions using long short-term memory networks applying facial features in addition to other visual features at different time steps. We compare a comprehensive collection of image captioning models with and without facial features using all standard evaluation metrics. The evaluation metrics indicate that applying facial features with an attention mechanism achieves the best performance, showing more expressive and more correlated image captions, on an image caption dataset extracted from the standard Flickr 30K dataset, consisting of around 11K images containing faces. An analysis of the generated captions finds that, perhaps unexpectedly, the improvement in caption quality appears to come not from the addition of adjectives linked to emotional aspects of the images, but from more variety in the actions described in the captions.

Author(s):  
Chaitrali Prasanna Chaudhari ◽  
Satish Devane

“Image Captioning is the process of generating a textual description of an image”. It deploys both computer vision and natural language processing for caption generation. However, the majority of the image captioning systems offer unclear depictions regarding the objects like “man”, “woman”, “group of people”, “building”, etc. Hence, this paper intends to develop an intelligent-based image captioning model. The adopted model comprises of few steps like word generation, sentence formation, and caption generation. Initially, the input image is subjected to the Deep learning classifier called Convolutional Neural Network (CNN). Since the classifier is already trained in the relevant words that are related to all images, it can easily classify the associated words of the given image. Further, a set of sentences is formed with the generated words using Long-Short Term Memory (LSTM) model. The likelihood of the formed sentences is computed using the Maximum Likelihood (ML) function, and the sentences with higher probability are taken, which is further used for generating the visual representation of the scene in terms of image caption. As a major novelty, this paper aims to enhance the performance of CNN by optimally tuning its weight and activation function. This paper introduces a new enhanced optimization algorithm Rider with Randomized Bypass and Over-taker update (RR-BOU) for this optimal selection. In the proposed RR-BOU is the enhanced version of the Rider Optimization Algorithm (ROA). Finally, the performance of the proposed captioning model is compared over other conventional models with respect to statistical analysis.


2020 ◽  
Vol 17 (1) ◽  
pp. 473-478
Author(s):  
Mayank ◽  
Naveen Kumar Gondhi

Image Captioning is the combination of Computer Vision and Natural Language Processing (NLP) in which simple sentences have been automatically generated describing the content of the image. This paper presents the comparative analysis of different models used for the generation of descriptive English captions for a given image. Feature extractions of the images are done using Convolutional Neural Networks (CNN). These features are then, passed onto Recurrent Neural Networks (RNN) or Long Short-term Memory (LSTM) to generate captions in English language. The evaluation metrics used to appraise the conduct of the models are BLEU score, CIDEr and METEOR.


Author(s):  
Sujeet Kumar Shukla ◽  
Saurabh Dubey ◽  
Aniket Kumar Pandey ◽  
Vineet Mishra ◽  
Mayank Awasthi ◽  
...  

In this paper, we focus on one of the visual recognition facets of computer vision, i.e. image captioning. This model’s goal is to come up with captions for an image. Using deep learning techniques, image captioning aims to generate captions for an image automatically. Initially, a Convolutional Neural Network is used to detect the objects in the image (InceptionV3). Recurrent Neural Networks (RNN) and Long Short Term Memory (LSTM) with attention mechanism are used to generate a syntactically and semantically correct caption for the image based on the detected objects. In our project, we're working with a traffic sign dataset that has been captioned using the process described above. This model is extremely useful for visually impaired people who need to cross roads safely.


Author(s):  
Dr. A. M. Chandrashekhar

Describing the content of an image has been a fundamental problem of Machine learning that connects computer vision and natural language processing. In recent years, the task of object recognition has advanced at an exceptional rate which in turn has made image captioning that much better and easier. In this paper, we have discussed the usage of image captioning using deep learning for the visually impaired. We have used Convolutional Neural Networks along with Long Short-Term Memory to train and generate captions for images along with a text-to-speech engine which makes the experience of visually impaired users who are browsing the internet much smoother. We discuss how the model was implemented, its different components and modules along with a result analysis conducted on a set of outputs peer reviewed by our colleagues, friends and professors.


2019 ◽  
Vol 9 (9) ◽  
pp. 1871 ◽  
Author(s):  
Chanrith Poleak ◽  
Jangwoo Kwon

Automatically generating a novel description of an image is a challenging and important problem that brings together advanced research in both computer vision and natural language processing. In recent years, image captioning has significantly improved its performance by using long short-term memory (LSTM) as a decoder for the language model. However, despite this improvement, LSTM itself has its own shortcomings as a model because the structure is complicated and its nature is inherently sequential. This paper proposes a model using a simple convolutional network for both encoder and decoder functions of image captioning, instead of the current state-of-the-art approach. Our experiment with this model on a Microsoft Common Objects in Context (MSCOCO) captioning dataset yielded results that are competitive with the state-of-the-art image captioning model across different evaluation metrics, while having a much simpler model and enabling parallel graphics processing unit (GPU) computation during training, resulting in a faster training time.


Author(s):  
Anish Banda

Abstract: In the model we proposed, we examine the deep neural networks-based image caption generation technique. We give image as input to the model, the technique give output in three different forms i.e., sentence in three different languages describing the image, mp3 audio file and an image file is also generated. In this model, we use the techniques of both computer vision and natural language processing. We are aiming to develop a model using the techniques of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) to build a model to generate a Caption. Target image is compared with the training images, we have a large dataset containing the training images, this is done by convolutional neural network. This model generates a decent description utilizing the trained data. To extract features from images we need encoder, we use CNN as encoder. To decode the description of image generated we use LSTM. To evaluate the accuracy of generated caption we use BLEU metric algorithm. It grades the quality of content generated. Performance is calculated by the standard calculation matrices. Keywords: CNN, RNN, LSTM, BLEU score, encoder, decoder, captions, image description.


A language known to humans is a natural language. In computer science it is the most challenging task to make the computers understand the natural languages and generating caption automatically from the given image. While a lot of work has been done, the total solution to this problem has been demonstrated daunting so far. Image captioning is a crucial job involving linguistic image understanding and the ability to generate interpretation of sentences with proper and accurate structure. It requires expertise in Image processing and natural language processing. The publishers suggest in this practice a system using the multilayer Convolutional Neural Network (CNN) to generate language describing the images and Long Short Term Memory (LSTM) to concisely frame relevant phrases using the driven keywords. We aim in this article to provide a brief overview of current methods and algorithms of image captioning using deep learning. We also address datasets and measurement criteria widely used for the same.


2020 ◽  
Vol 44 (4) ◽  
pp. 618-626
Author(s):  
A.S. Kharchevnikova ◽  
A.V. Savchenko

The paper considers a problem of extracting user preferences based on their photo gallery. We propose a novel approach based on image captioning, i.e., automatic generation of textual descriptions of photos, and their classification. Known image captioning methods based on convolutional and recurrent (Long short-term memory) neural networks are analyzed. We train several models that combine the visual features of a photograph and the outputs of an Long short-term memory block by using Google's Conceptual Captions dataset. We examine application of natural language processing algorithms to transform obtained textual annotations into user preferences. Experimental studies are carried out using Microsoft COCO Captions, Flickr8k and a specially collected dataset reflecting the user’s interests. It is demonstrated that the best quality of preference prediction is achieved using keyword search methods and text summarization from Watson API, which are 8 % more accurate compared to traditional latent Dirichlet allocation. Moreover, descriptions generated by trained neural models are classified 1 – 7 % more accurately when compared to known image captioning models.


2021 ◽  
Vol 21 (2) ◽  
pp. 1-25
Author(s):  
Pin Ni ◽  
Yuming Li ◽  
Gangmin Li ◽  
Victor Chang

Cyber-Physical Systems (CPS), as a multi-dimensional complex system that connects the physical world and the cyber world, has a strong demand for processing large amounts of heterogeneous data. These tasks also include Natural Language Inference (NLI) tasks based on text from different sources. However, the current research on natural language processing in CPS does not involve exploration in this field. Therefore, this study proposes a Siamese Network structure that combines Stacked Residual Long Short-Term Memory (bidirectional) with the Attention mechanism and Capsule Network for the NLI module in CPS, which is used to infer the relationship between text/language data from different sources. This model is mainly used to implement NLI tasks and conduct a detailed evaluation in three main NLI benchmarks as the basic semantic understanding module in CPS. Comparative experiments prove that the proposed method achieves competitive performance, has a certain generalization ability, and can balance the performance and the number of trained parameters.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Rakesh David ◽  
Rhys-Joshua D. Menezes ◽  
Jan De Klerk ◽  
Ian R. Castleden ◽  
Cornelia M. Hooper ◽  
...  

AbstractThe increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses.


Sign in / Sign up

Export Citation Format

Share Document