scholarly journals From image to language and back again

2018 ◽  
Vol 24 (3) ◽  
pp. 325-362
Author(s):  
A. BELZ ◽  
T.L. BERG ◽  
L. YU

Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Franket al.), multimodal machine translation (Madhyasthaet al., Franket al.), image caption generation (Madhyasthaet al., Tantiet al.), visual scene understanding (Silbereret al.), and multimodal learning of high-level attributes (Sorodocet al.). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).

2018 ◽  
Vol 8 (10) ◽  
pp. 1850 ◽  
Author(s):  
Zhibin Guan ◽  
Kang Liu ◽  
Yan Ma ◽  
Xu Qian ◽  
Tongkai Ji

Image caption generation is attractive research which focuses on generating natural language sentences to describe the visual content of a given image. It is an interdisciplinary subject combining computer vision (CV) and natural language processing (NLP). The existing image captioning methods are mainly focused on generating the final image caption directly, which may lose significant identification information of objects contained in the raw image. Therefore, we propose a new middle-level attribute-based language retouching (MLALR) method to solve this problem. Our proposed MLALR method uses the middle-level attributes predicted from the object regions to retouch the intermediate image description, which is generated by our language generation model. The advantage of our MLALR method is that it can correct descriptive errors in the intermediate image description and make the final image caption more accurate. Moreover, evaluation using benchmark datasets—MSCOCO, Flickr8K, and Flickr30K—validated the impressive performance of our MLALR method with evaluation metrics—BLEU, METEOR, ROUGE-L, CIDEr, and SPICE.


2021 ◽  
pp. 105971232098304
Author(s):  
R Alexander Bentley ◽  
Joshua Borycz ◽  
Simon Carrignon ◽  
Damian J Ruck ◽  
Michael J O’Brien

The explosion of online knowledge has made knowledge, paradoxically, difficult to find. A web or journal search might retrieve thousands of articles, ranked in a manner that is biased by, for example, popularity or eigenvalue centrality rather than by informed relevance to the complex query. With hundreds of thousands of articles published each year, the dense, tangled thicket of knowledge grows even more entwined. Although natural language processing and new methods of generating knowledge graphs can extract increasingly high-level interpretations from research articles, the results are inevitably biased toward recent, popular, and/or prestigious sources. This is a result of the inherent nature of human social-learning processes. To preserve and even rediscover lost scientific ideas, we employ the theory that scientific progress is punctuated by means of inspired, revolutionary ideas at the origin of new paradigms. Using a brief case example, we suggest how phylogenetic inference might be used to rediscover potentially useful lost discoveries, as a way in which machines could help drive revolutionary science.


Author(s):  
Xiaohan Guan ◽  
Jianhui Han ◽  
Zhi Liu ◽  
Mengmeng Zhang

Many tasks of natural language processing such as information retrieval, intelligent question answering, and machine translation require the calculation of sentence similarity. The traditional calculation methods used in the past could not solve semantic understanding problems well. First, the model structure based on Siamese lack of interaction between sentences; second, it has matching problem which contains lacking position information and only using partial matching factor based on the matching model. In this paper, a combination of word and word’s dependence is proposed to calculate the sentence similarity. This combination can extract the word features and word’s dependency features. To extract more matching features, a bi-directional multi-interaction matching sequence model is proposed by using word2vec and dependency2vec. This model obtains matching features by convolving and pooling the word-granularity (word vector, dependency vector) interaction sequences in two directions. Next, the model aggregates the bi-direction matching features. The paper evaluates the model on two tasks: paraphrase identification and natural language inference. The experimental results show that the combination of word and word’s dependence can enhance the ability of extracting matching features between two sentences. The results also show that the model with dependency can achieve higher accuracy than these models without using dependency.


2021 ◽  
pp. 42-55
Author(s):  
Shitiz Gupta ◽  
◽  
◽  
◽  
◽  
...  

Image caption generation is a stimulating multimodal task. Substantial advancements have been made in thefield of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transferlearning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention,along with embedded text to generate high accuracy captions. We have compared these models on severalbenchmark datasets based on different evaluation metrics like BLEU and METEOR.


2020 ◽  
Vol 2020 ◽  
pp. 1-13 ◽  
Author(s):  
Haoran Wang ◽  
Yue Zhang ◽  
Xiaosheng Yu

In recent years, with the rapid development of artificial intelligence, image caption has gradually attracted the attention of many researchers in the field of artificial intelligence and has become an interesting and arduous task. Image caption, automatically generating natural language descriptions according to the content observed in an image, is an important part of scene understanding, which combines the knowledge of computer vision and natural language processing. The application of image caption is extensive and significant, for example, the realization of human-computer interaction. This paper summarizes the related methods and focuses on the attention mechanism, which plays an important role in computer vision and is recently widely used in image caption generation tasks. Furthermore, the advantages and the shortcomings of these methods are discussed, providing the commonly used datasets and evaluation criteria in this field. Finally, this paper highlights some open challenges in the image caption task.


Author(s):  
Harshit Dua

Nowadays, there is massive research in generating automatic image caption; this technique is very challenging and uses Natural language processing. For instance, it could assist incapacitated people with improving the matter of images on the web. Likewise, it could give more precise and minimized images/recordings in situations, such as picture sharing in interpersonal organization or video surveillance system. The structure comprises a convolutional neural organization (CNN) traced by a repetitive neural organization (RNN). The strategy can produce picture sayings that are generally semantically unmistakable and linguistically right by taking in information from picture and subtitle matches. Individuals, for the most part, depict a scene utilizing characteristic languages which are concise and reduced. However, computer vision frameworks define the set by taking a picture which is a two-measurement presentation. The plan is to picture and engrave similar places and projects from the image to the sentences.


Artificial intelligence has open doors to new opportunities for research and development. And the recent development in machine learning and deep learning has paved a way to deal with complex problem easily. Now a day’s every aspect of human life can now be thought of as a problem statement that can be implemented and is useful in one way or the other. One such aspect is human ability to understand and describe the surrounding to which they interact and take decision accordingly. This ability can also be used in machines or bots to make human and machine interaction easier. Generating captions for images is same we need to describe images based on what you see. This task can be considered as a combination of computer vision and natural language processing. In this paper we performs a survey of various methods and techniques that can be useful in understanding how this task can be done. The survey mainly focuses on neural network techniques, because they give state of the art results.


2019 ◽  
Vol 123 ◽  
pp. 89-95 ◽  
Author(s):  
Songtao Ding ◽  
Shiru Qu ◽  
Yuling Xi ◽  
Arun Kumar Sangaiah ◽  
Shaohua Wan

Author(s):  
Vaishali Fulmal, Et. al.

Question-answer systems are referred to as advanced systems that can be used to provide answers to the questions which are asked by the user.  The typical problem in natural language processing is automatic question-answering. The question-answering is aiming at designing systems that can automatically answer a question, in the same way as a human can find answers to questions. Community question answering (CQA) services are becoming popular over the past few years. It allows the members of the community to post as well as answer the questions. It helps users to get information from a comprehensive set of questions that are well answered. In the proposed system, a deep learning-based model is used for the automatic answering of the user’s questions. First, the questions from the dataset are embedded. The deep neural network is trained to find the similarity between questions. The best answer for each question is found as the one with the highest similarity score. The purpose of the proposed system is to design a model that helps to get the answer of a question automatically. The proposed system uses a hierarchical clustering algorithm for clustering the questions.


2021 ◽  
Vol 8 (1) ◽  
pp. 33-62
Author(s):  
Yifan Xu ◽  
Huapeng Wei ◽  
Minxuan Lin ◽  
Yingying Deng ◽  
Kekai Sheng ◽  
...  

AbstractTransformers, the dominant architecture for natural language processing, have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance. Transformers are sequence-to-sequence models, which use a self-attention mechanism rather than the RNN sequential structure. Thus, such models can be trained in parallel and can represent global information. This study comprehensively surveys recent visual transformer works. We categorize them according to task scenario: backbone design, high-level vision, low-level vision and generation, and multimodal learning. Their key ideas are also analyzed. Differing from previous surveys, we mainly focus on visual transformer methods in low-level vision and generation. The latest works on backbone design are also reviewed in detail. For ease of understanding, we precisely describe the main contributions of the latest works in the form of tables. As well as giving quantitative comparisons, we also present image results for low-level vision and generation tasks. Computational costs and source code links for various important works are also given in this survey to assist further development.


Sign in / Sign up

Export Citation Format

Share Document