scholarly journals Visual preferences prediction for a photo gallery based on image captioning methods

2020 ◽  
Vol 44 (4) ◽  
pp. 618-626
Author(s):  
A.S. Kharchevnikova ◽  
A.V. Savchenko

The paper considers a problem of extracting user preferences based on their photo gallery. We propose a novel approach based on image captioning, i.e., automatic generation of textual descriptions of photos, and their classification. Known image captioning methods based on convolutional and recurrent (Long short-term memory) neural networks are analyzed. We train several models that combine the visual features of a photograph and the outputs of an Long short-term memory block by using Google's Conceptual Captions dataset. We examine application of natural language processing algorithms to transform obtained textual annotations into user preferences. Experimental studies are carried out using Microsoft COCO Captions, Flickr8k and a specially collected dataset reflecting the user’s interests. It is demonstrated that the best quality of preference prediction is achieved using keyword search methods and text summarization from Watson API, which are 8 % more accurate compared to traditional latent Dirichlet allocation. Moreover, descriptions generated by trained neural models are classified 1 – 7 % more accurately when compared to known image captioning models.

A language known to humans is a natural language. In computer science it is the most challenging task to make the computers understand the natural languages and generating caption automatically from the given image. While a lot of work has been done, the total solution to this problem has been demonstrated daunting so far. Image captioning is a crucial job involving linguistic image understanding and the ability to generate interpretation of sentences with proper and accurate structure. It requires expertise in Image processing and natural language processing. The publishers suggest in this practice a system using the multilayer Convolutional Neural Network (CNN) to generate language describing the images and Long Short Term Memory (LSTM) to concisely frame relevant phrases using the driven keywords. We aim in this article to provide a brief overview of current methods and algorithms of image captioning using deep learning. We also address datasets and measurement criteria widely used for the same.


2018 ◽  
Vol 10 (11) ◽  
pp. 113 ◽  
Author(s):  
Yue Li ◽  
Xutao Wang ◽  
Pengjian Xu

Text classification is of importance in natural language processing, as the massive text information containing huge amounts of value needs to be classified into different categories for further use. In order to better classify text, our paper tries to build a deep learning model which achieves better classification results in Chinese text than those of other researchers’ models. After comparing different methods, long short-term memory (LSTM) and convolutional neural network (CNN) methods were selected as deep learning methods to classify Chinese text. LSTM is a special kind of recurrent neural network (RNN), which is capable of processing serialized information through its recurrent structure. By contrast, CNN has shown its ability to extract features from visual imagery. Therefore, two layers of LSTM and one layer of CNN were integrated to our new model: the BLSTM-C model (BLSTM stands for bi-directional long short-term memory while C stands for CNN.) LSTM was responsible for obtaining a sequence output based on past and future contexts, which was then input to the convolutional layer for extracting features. In our experiments, the proposed BLSTM-C model was evaluated in several ways. In the results, the model exhibited remarkable performance in text classification, especially in Chinese texts.


Symmetry ◽  
2019 ◽  
Vol 11 (10) ◽  
pp. 1290 ◽  
Author(s):  
Rahman ◽  
Siddiqui

Abstractive text summarization that generates a summary by paraphrasing a long text remains an open significant problem for natural language processing. In this paper, we present an abstractive text summarization model, multi-layered attentional peephole convolutional LSTM (long short-term memory) (MAPCoL) that automatically generates a summary from a long text. We optimize parameters of MAPCoL using central composite design (CCD) in combination with the response surface methodology (RSM), which gives the highest accuracy in terms of summary generation. We record the accuracy of our model (MAPCoL) on a CNN/DailyMail dataset. We perform a comparative analysis of the accuracy of MAPCoL with that of the state-of-the-art models in different experimental settings. The MAPCoL also outperforms the traditional LSTM-based models in respect of semantic coherence in the output summary.


2020 ◽  
Vol 9 (4) ◽  
pp. 194
Author(s):  
Wenqi Cui ◽  
Xin He ◽  
Meng Yao ◽  
Ziwei Wang ◽  
Jie Li ◽  
...  

When a landslide happens, it is important to recognize the hazard-affected bodies surrounding the landslide for the risk assessment and emergency rescue. In order to realize the recognition, the spatial relationship between landslides and other geographic objects such as residence, roads and schools needs to be defined. Comparing with semantic segmentation and instance segmentation that can only recognize the geographic objects separately, image captioning can provide richer semantic information including the spatial relationship among these objects. However, the traditional image captioning methods based on RNNs have two main shortcomings: the errors in the prediction process are often accumulated and the location of attention is not always accurate which would lead to misjudgment of risk. To handle these problems, a landslide image interpretation network based on a semantic gate and a bi-temporal long-short term memory network (SG-BiTLSTM) is proposed in this paper. In the SG-BiTLSTM architecture, a U-Net is employed as an encoder to extract features of the images and generate the mask maps of the landslides and other geographic objects. The decoder of this structure consists of two interactive long-short term memory networks (LSTMs) to describe the spatial relationship among these geographic objects so that to further determine the role of the classified geographic objects for identifying the hazard-affected bodies. The purpose of this research is to judge the hazard-affected bodies of the landslide (i.e., buildings and roads) through the SG-BiTLSTM network to provide geographic information support for emergency service. The remote sensing data was taken by Worldview satellite after the Wenchuan earthquake happened in 2008. The experimental results demonstrate that SG-BiTLSTM network shows remarkable improvements on the recognition of landslide and hazard-affected bodies, compared with the traditional LSTM (the Baseline Model), the BLEU1 of the SG-BiTLSTM is improved by 5.89%, the matching rate between the mask maps and the focus matrix of the attention is improved by 42.81%. In conclusion, the SG-BiTLSTM network can recognize landslides and the hazard-affected bodies simultaneously to provide basic geographic information service for emergency decision-making.


Author(s):  
Tao Gui ◽  
Qi Zhang ◽  
Lujun Zhao ◽  
Yaosong Lin ◽  
Minlong Peng ◽  
...  

In recent years, long short-term memory (LSTM) has been successfully used to model sequential data of variable length. However, LSTM can still experience difficulty in capturing long-term dependencies. In this work, we tried to alleviate this problem by introducing a dynamic skip connection, which can learn to directly connect two dependent words. Since there is no dependency information in the training data, we propose a novel reinforcement learning-based method to model the dependency relationship and connect dependent words. The proposed model computes the recurrent transition functions based on the skip connections, which provides a dynamic skipping advantage over RNNs that always tackle entire sentences sequentially. Our experimental results on three natural language processing tasks demonstrate that the proposed method can achieve better performance than existing methods. In the number prediction experiment, the proposed model outperformed LSTM with respect to accuracy by nearly 20%.


Author(s):  
Satish Tirumalapudi

Abstract: Chat bots are software applications that help users to communicate with the machine and get the required result, this is where Natural Language Processing (NLP) comes into the picture. Natural language processing is based on deep learning that enables computers to acquire meaning from inputs given by the users. Natural language processing techniques can make possible the use of natural language to express ideas, thus drastically increasing accessibility. NLP engines rely on the elements of intent, utterance, entity, context, and session. Here in this project, we will be using Deep learning techniques which will be trained on the dataset which contains categories, patterns, and responses. Long Short-Term Memory (LSTM) is a Recurrent Neural Network that is capable of learning order dependence in sequence prediction problems. One of the most popular RNN approaches is LSTM to identify and control a dynamic system. We use an RNN to classify the category user’s message belongs to and then will give a response from the list of responses. Keywords: NLP – Natural Language Processing, LSTM – Long Short Term Memory, RNN – Recurrent Neural Networks.


Author(s):  
Yudi Widhiyasana ◽  
Transmissia Semiawan ◽  
Ilham Gibran Achmad Mudzakir ◽  
Muhammad Randi Noor

Klasifikasi teks saat ini telah menjadi sebuah bidang yang banyak diteliti, khususnya terkait Natural Language Processing (NLP). Terdapat banyak metode yang dapat dimanfaatkan untuk melakukan klasifikasi teks, salah satunya adalah metode deep learning. RNN, CNN, dan LSTM merupakan beberapa metode deep learning yang umum digunakan untuk mengklasifikasikan teks. Makalah ini bertujuan menganalisis penerapan kombinasi dua buah metode deep learning, yaitu CNN dan LSTM (C-LSTM). Kombinasi kedua metode tersebut dimanfaatkan untuk melakukan klasifikasi teks berita bahasa Indonesia. Data yang digunakan adalah teks berita bahasa Indonesia yang dikumpulkan dari portal-portal berita berbahasa Indonesia. Data yang dikumpulkan dikelompokkan menjadi tiga kategori berita berdasarkan lingkupnya, yaitu “Nasional”, “Internasional”, dan “Regional”. Dalam makalah ini dilakukan eksperimen pada tiga buah variabel penelitian, yaitu jumlah dokumen, ukuran batch, dan nilai learning rate dari C-LSTM yang dibangun. Hasil eksperimen menunjukkan bahwa nilai F1-score yang diperoleh dari hasil klasifikasi menggunakan metode C-LSTM adalah sebesar 93,27%. Nilai F1-score yang dihasilkan oleh metode C-LSTM lebih besar dibandingkan dengan CNN, dengan nilai 89,85%, dan LSTM, dengan nilai 90,87%. Dengan demikian, dapat disimpulkan bahwa kombinasi dua metode deep learning, yaitu CNN dan LSTM (C-LSTM),memiliki kinerja yang lebih baik dibandingkan dengan CNN dan LSTM.


2021 ◽  
Vol 12 (1) ◽  
pp. 209
Author(s):  
Yeong-Hwa Chang ◽  
Yen-Jen Chen ◽  
Ren-Hung Huang ◽  
Yi-Ting Yu

Automatically describing the content of an image is an interesting and challenging task in artificial intelligence. In this paper, an enhanced image captioning model—including object detection, color analysis, and image captioning—is proposed to automatically generate the textual descriptions of images. In an encoder–decoder model for image captioning, VGG16 is used as an encoder and an LSTM (long short-term memory) network with attention is used as a decoder. In addition, Mask R-CNN with OpenCV is used for object detection and color analysis. The integration of the image caption and color recognition is then performed to provide better descriptive details of images. Moreover, the generated textual sentence is converted into speech. The validation results illustrate that the proposed method can provide more accurate description of images.


Sign in / Sign up

Export Citation Format

Share Document