scholarly journals Attention Feature Network Extraction Combined with the Generation Algorithm of Multimedia Image Description

2021 ◽  
Vol 2021 ◽  
pp. 1-8
Author(s):  
Beibei Sun

In view of the issue that the features of the images in the shallow layer cannot be fully utilized when the image description is generated and the target association of the image cannot be sufficiently obtained, a generation method for the description of the acquisition of attention images is put forward in this paper. The proportions of the features of images at various depths are autonomously assigned based on the content data of the language model, and the images thus generated are all pictures with image features with attention. In this way, the effect of description generation of images has been improved. After the testing of the database, the results indicate that the calculation method of the algorithm put forward in this paper is more accurate than the top-down multimedia image algorithm generated by a single attention.

Author(s):  
Huimin Lu ◽  
Rui Yang ◽  
Zhenrong Deng ◽  
Yonglin Zhang ◽  
Guangwei Gao ◽  
...  

Chinese image description generation tasks usually have some challenges, such as single-feature extraction, lack of global information, and lack of detailed description of the image content. To address these limitations, we propose a fuzzy attention-based DenseNet-BiLSTM Chinese image captioning method in this article. In the proposed method, we first improve the densely connected network to extract features of the image at different scales and to enhance the model’s ability to capture the weak features. At the same time, a bidirectional LSTM is used as the decoder to enhance the use of context information. The introduction of an improved fuzzy attention mechanism effectively improves the problem of correspondence between image features and contextual information. We conduct experiments on the AI Challenger dataset to evaluate the performance of the model. The results show that compared with other models, our proposed model achieves higher scores in objective quantitative evaluation indicators, including BLEU , BLEU , METEOR, ROUGEl, and CIDEr. The generated description sentence can accurately express the image content.


2012 ◽  
Vol 442 ◽  
pp. 379-385
Author(s):  
Zhi Kun Chen ◽  
Ying Wang ◽  
Yu Tian Wang ◽  
Yi Li

The accurate prediction and control of burning through point are the keys of improving the quantity and quality of sinter. The position of burning through point can be determined by identifying the flame image features of the tail through the sinter. How to effectively segment the image of the flame is the key to identify the characteristics of the flame. By using the approach of the c-means clustering flame image segmentation based on particle-pair optimizer, the flame image of the sintering machine plant-tail section will be segmented in this paper. Frequently, the standard c-means clustering algorithm may easily immerse in partial minimum and slow converging speed. However, this new calculation method can overcome these shortcomings. Besides, the results of the experiment also show that this method has many other advantages, such as effectively removing the halo of the sintering machine plant-tail section, high segmenting speed and obvious segmenting effects. This approach will lay a good foundation for the extraction and identification of image features in the following stages.


2018 ◽  
Vol 24 (3) ◽  
pp. 467-489 ◽  
Author(s):  
MARC TANTI ◽  
ALBERT GATT ◽  
KENNETH P. CAMILLERI

AbstractWhen a recurrent neural network (RNN) language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN – conditioning the language model by ‘injecting’ image features – or in a layer following the RNN – conditioning the language model by ‘merging’ image features. While both options are attested in the literature, there is as yet no systematic comparison between the two. In this paper, we empirically show that it is not especially detrimental to performance whether one architecture is used or another. The merge architecture does have practical advantages, as conditioning by merging allows the RNN’s hidden state vector to shrink in size by up to four times. Our results suggest that the visual and linguistic modalities for caption generation need not be jointly encoded by the RNN as that yields large, memory-intensive models with few tangible advantages in performance; rather, the multimodal integration should be delayed to a subsequent stage.


Author(s):  
Mehryar Emambakhsh ◽  
Hossein Ebrahimnezhad ◽  
Mohammad Sedaaghi

Integrated region-based segmentation using color components and texture features with prior shape knowledgeSegmentation is the art of partitioning an image into different regions where each one has some degree of uniformity in its feature space. A number of methods have been proposed and blind segmentation is one of them. It uses intrinsic image features, such as pixel intensity, color components and texture. However, some virtues, like poor contrast, noise and occlusion, can weaken the procedure. To overcome them, prior knowledge of the object of interest has to be incorporated in a top-down procedure for segmentation. Consequently, in this work, a novel integrated algorithm is proposed combining bottom-up (blind) and top-down (including shape prior) techniques. First, a color space transformation is performed. Then, an energy function (based on nonlinear diffusion of color components and directional derivatives) is defined. Next, signeddistance functions are generated from different shapes of the object of interest. Finally, a variational framework (based on the level set) is employed to minimize the energy function. The experimental results demonstrate a good performance of the proposed method compared with others and show its robustness in the presence of noise and occlusion. The proposed algorithm is applicable in outdoor and medical image segmentation and also in optical character recognition (OCR).


Sensors ◽  
2020 ◽  
Vol 20 (2) ◽  
pp. 342 ◽  
Author(s):  
Yassin Kortli ◽  
Maher Jridi ◽  
Ayman Al Falou ◽  
Mohamed Atri

Over the past few decades, interest in theories and algorithms for face recognition has been growing rapidly. Video surveillance, criminal identification, building access control, and unmanned and autonomous vehicles are just a few examples of concrete applications that are gaining attraction among industries. Various techniques are being developed including local, holistic, and hybrid approaches, which provide a face image description using only a few face image features or the whole facial features. The main contribution of this survey is to review some well-known techniques for each approach and to give the taxonomy of their categories. In the paper, a detailed comparison between these techniques is exposed by listing the advantages and the disadvantages of their schemes in terms of robustness, accuracy, complexity, and discrimination. One interesting feature mentioned in the paper is about the database used for face recognition. An overview of the most commonly used databases, including those of supervised and unsupervised learning, is given. Numerical results of the most interesting techniques are given along with the context of experiments and challenges handled by these techniques. Finally, a solid discussion is given in the paper about future directions in terms of techniques to be used for face recognition.


Author(s):  
Yash Mehta ◽  
Samin Fatehi ◽  
Amirmohammad Kazameini ◽  
Clemens Stachl ◽  
Erik Cambria ◽  
...  

2021 ◽  
Author(s):  
Jacob Raleigh Cheeseman ◽  
Roland Fleming ◽  
Filipp Schmidt

Many natural materials have complex, multi-scale structures. Consequently, the apparent identity of a surface can vary with the assumed spatial scale of the scene: a plowed field seen from afar can resemble corduroy seen up close. We investigated this ‘material-scale ambiguity’ using 87 photographs of diverse materials (e.g., water, sand, stone, metal, wood). Across two experiments, separate groups of participants (N = 72 adults) provided judgements of the material depicted in each image, either with or without manipulations of apparent distance (by verbal instructions, or adding objects of familiar size). Our results demonstrate that these manipulations can cause identical images to appear to belong to completely different material categories, depending on the perceived scale. Under challenging conditions, therefore, the perception of materials is susceptible to simple manipulations of apparent distance, revealing a striking example of top-down effects in the interpretation of image features.


2001 ◽  
Vol 27 (2) ◽  
pp. 249-276 ◽  
Author(s):  
Brian Roark

This paper describes the functioning of a broad-coverage probabilistic top-down parser, and its application to the problem of language modeling for speech recognition. The paper first introduces key notions in language modeling and probabilistic parsing, and briefly reviews some previous approaches to using syntactic structure for language modeling. A lexicalized probabilistic top-down parser is then presented, which performs very well, in terms of both the accuracy of returned parses and the efficiency with which they are found, relative to the best broad-coverage statistical parsers. A new language model that utilizes probabilistic top-down parsing is then outlined, and empirical results show that it improves upon previous work in test corpus perplexity. Interpolation with a trigram model yields an exceptional improvement relative to the improvement observed by other models, demonstrating the degree to which the information captured by our parsing model is orthogonal to that captured by a trigram model. A small recognition experiment also demonstrates the utility of the model.


2020 ◽  
Vol 10 (17) ◽  
pp. 5978
Author(s):  
Viktar Atliha ◽  
Dmitrij Šešok

Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human. In recent years, this research field has rapidly developed and a number of impressive results have been achieved. The typical models are based on a neural networks, including convolutional ones for encoding images and recurrent ones for decoding them into text. More than that, attention mechanism and transformers are actively used for boosting performance. However, even the best models have a limit in their quality with a lack of data. In order to generate a variety of descriptions of objects in different situations you need a large training set. The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image. We expanded the training dataset using text augmentation methods. Methods include augmentation with synonyms as a baseline and the state-of-the-art language model called Bidirectional Encoder Representations from Transformers (BERT). As a result, models that were trained on a datasets augmented show better results than that models trained on a dataset without augmentation.


Sign in / Sign up

Export Citation Format

Share Document