Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTM

Author(s):  
Huimin Lu ◽  
Rui Yang ◽  
Zhenrong Deng ◽  
Yonglin Zhang ◽  
Guangwei Gao ◽  
...  

Chinese image description generation tasks usually have some challenges, such as single-feature extraction, lack of global information, and lack of detailed description of the image content. To address these limitations, we propose a fuzzy attention-based DenseNet-BiLSTM Chinese image captioning method in this article. In the proposed method, we first improve the densely connected network to extract features of the image at different scales and to enhance the model’s ability to capture the weak features. At the same time, a bidirectional LSTM is used as the decoder to enhance the use of context information. The introduction of an improved fuzzy attention mechanism effectively improves the problem of correspondence between image features and contextual information. We conduct experiments on the AI Challenger dataset to evaluate the performance of the model. The results show that compared with other models, our proposed model achieves higher scores in objective quantitative evaluation indicators, including BLEU , BLEU , METEOR, ROUGEl, and CIDEr. The generated description sentence can accurately express the image content.

2020 ◽  
Vol 2020 ◽  
pp. 1-11 ◽  
Author(s):  
Wen Qian ◽  
Chao Zhou ◽  
Dengyin Zhang

In this paper, we present an extremely computation-efficient model called FAOD-Net for dehazing single image. FAOD-Net is based on a streamlined architecture that uses depthwise separable convolutions to build lightweight deep neural networks. Moreover, the pyramid pooling module is added in FAOD-Net to aggregate the context information of different regions of the image, thereby improving the ability of the network model to obtain the global information of the foggy image. To get the best FAOD-Net, we use the RESIDE training set to train our proposed model. In addition, we have carried out extensive experiments on the RESIDE test set. We use full-reference and no-reference image quality evaluation indicators to measure the effect of dehazing. Experimental results show that the proposed algorithm has satisfactory results in terms of defogging quality and speed.


Symmetry ◽  
2021 ◽  
Vol 13 (7) ◽  
pp. 1184
Author(s):  
Peng Tian ◽  
Hongwei Mo ◽  
Laihao Jiang

Object detection, visual relationship detection, and image captioning, which are the three main visual tasks in scene understanding, are highly correlated and correspond to different semantic levels of scene image. However, the existing captioning methods convert the extracted image features into description text, and the obtained results are not satisfactory. In this work, we propose a Multi-level Semantic Context Information (MSCI) network with an overall symmetrical structure to leverage the mutual connections across the three different semantic layers and extract the context information between them, to solve jointly the three vision tasks for achieving the accurate and comprehensive description of the scene image. The model uses a feature refining structure to mutual connections and iteratively updates the different semantic features of the image. Then a context information extraction network is used to extract the context information between the three different semantic layers, and an attention mechanism is introduced to improve the accuracy of image captioning while using the context information between the different semantic layers to improve the accuracy of object detection and relationship detection. Experiments on the VRD and COCO datasets demonstrate that our proposed model can leverage the context information between semantic layers to improve the accuracy of those visual tasks generation.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Xiaodong Liu ◽  
Songyang Li ◽  
Miao Wang

The context, such as scenes and objects, plays an important role in video emotion recognition. The emotion recognition accuracy can be further improved when the context information is incorporated. Although previous research has considered the context information, the emotional clues contained in different images may be different, which is often ignored. To address the problem of emotion difference between different modes and different images, this paper proposes a hierarchical attention-based multimodal fusion network for video emotion recognition, which consists of a multimodal feature extraction module and a multimodal feature fusion module. The multimodal feature extraction module has three subnetworks used to extract features of facial, scene, and global images. Each subnetwork consists of two branches, where the first branch extracts the features of different modes, and the other branch generates the emotion score for each image. Features and emotion scores of all images in a modal are aggregated to generate the emotion feature of the modal. The other module takes multimodal features as input and generates the emotion score for each modal. Finally, features and emotion scores of multiple modes are aggregated, and the final emotion representation of the video will be produced. Experimental results show that our proposed method is effective on the emotion recognition dataset.


Sensors ◽  
2021 ◽  
Vol 21 (4) ◽  
pp. 1270
Author(s):  
Kiyohiko Iwamura ◽  
Jun Younes Louhi Kasahara ◽  
Alessandro Moro ◽  
Atsushi Yamashita ◽  
Hajime Asama

Automatic image captioning has many important applications, such as the depiction of visual contents for visually impaired people or the indexing of images on the internet. Recently, deep learning-based image captioning models have been researched extensively. For caption generation, they learn the relation between image features and words included in the captions. However, image features might not be relevant for certain words such as verbs. Therefore, our earlier reported method included the use of motion features along with image features for generating captions including verbs. However, all the motion features were used. Since not all motion features contributed positively to the captioning process, unnecessary motion features decreased the captioning accuracy. As described herein, we use experiments with motion features for thorough analysis of the reasons for the decline in accuracy. We propose a novel, end-to-end trainable method for image caption generation that alleviates the decreased accuracy of caption generation. Our proposed model was evaluated using three datasets: MSR-VTT2016-Image, MSCOCO, and several copyright-free images. Results demonstrate that our proposed method improves caption generation performance.


Information ◽  
2018 ◽  
Vol 9 (10) ◽  
pp. 252 ◽  
Author(s):  
Andrea Apicella ◽  
Anna Corazza ◽  
Francesco Isgrò ◽  
Giuseppe Vettigli

The use of ontological knowledge to improve classification results is a promising line of research. The availability of a probabilistic ontology raises the possibility of combining the probabilities coming from the ontology with the ones produced by a multi-class classifier that detects particular objects in an image. This combination not only provides the relations existing between the different segments, but can also improve the classification accuracy. In fact, it is known that the contextual information can often give information that suggests the correct class. This paper proposes a possible model that implements this integration, and the experimental assessment shows the effectiveness of the integration, especially when the classifier’s accuracy is relatively low. To assess the performance of the proposed model, we designed and implemented a simulated classifier that allows a priori decisions of its performance with sufficient precision.


2012 ◽  
Vol 12 (01) ◽  
pp. 1250001 ◽  
Author(s):  
TAREK HELMY

Automatic image categorization and description are key components for many applications, i.e., multimedia database management, web content analysis, human–computer interactions, and biometrics. In general, image description is a difficult task because of the wide variety of objects potentially to be recognized and the complexity and variety of backgrounds. This paper introduces a computational model for context-based image categorization and description. First, for a given image, a classifier is trained by the associated text features using advanced concepts, so that it can assign the image to a specific category. Then, a similarity matching with that category's annotated templates is performed for images in every other category. The proposed model uses novel text and image features that allow it to differentiate between geometrical images (GIs) and ordinary images. The experimental results show that the model is able to categorize correctly images with an expected increase in similarity matching as larger datasets and neural document classifier (NDC) are used. An important feature of the proposed model is that its specific matching techniques, suitable for a particular category, can be easily integrated and developed for other categories.


2021 ◽  
Author(s):  
Narmatha C ◽  
Manimegalai P ◽  
Krishnadass J ◽  
Prajoona Valsalan ◽  
Manimurugan S

Abstract This research presents an essential solution for classifying ultrasound diagnostic images describing seven types of ovarian cysts: Follicular cyst, Hemorrhagic cyst, Corpus luteum cyst, Polycystic-appearing ovary, endometriosis cysts, Dermoid cyst, and Teratoma. This work proposed a novel technique using images of ovarian ultrasound cysts from an ongoing database with this motivation. Initially, the work is followed by removing noise in preprocessing, feature extraction, and finally classifying using new Deep Q-Network with Harris Hawks Optimization (HHO) classifier. Automatic feature extraction is implemented using the recent popular convolutional neural network (CNN) technique that extracts image features as conditions in the reinforcement learning algorithm. With this, through the procedure of a new deep Q-learning algorithm, Deep Q-Network (DQN) is generated to train a Q-network. The swarm-based method of HHO utilized the optimization method to produce optimal hyperparameters in the DQN model known as HHO-DQN, a novel technique for classifying ovarian cysts. Extensive experimental evaluations on datasets show that the proposed HHO- DQN approach outperforms existing active learning approaches for ovarian cyst classification. Compared with the ANN, CNN, and AlexNet models, the performance of the proposed model is better in terms of precision, f-measure, recall, accuracy, and IoU. The proposed model has achieved 96% precision, 96.5% f-measure, 96% recall, 97% accuracy, and 0.65 IoU.


Author(s):  
Cheng-Kuan Chen ◽  
Zhufeng Pan ◽  
Ming-Yu Liu ◽  
Min Sun

Most of the existing works on image description focus on generating expressive descriptions. The only few works that are dedicated to generating stylish (e.g., romantic, lyric, etc.) descriptions suffer from limited style variation and content digression. To address these limitations, we propose a controllable stylish image description generation model. It can learn to generate stylish image descriptions that are more related to image content and can be trained with the arbitrary monolingual corpus without collecting new paired image and stylish descriptions. Moreover, it enables users to generate various stylish descriptions by plugging in style-specific parameters to include new styles into the existing model. We achieve this capability via a novel layer normalization layer design, which we will refer to as the Domain Layer Norm (DLN). Extensive experimental validation and user study on various stylish image description generation tasks are conducted to show the competitive advantages of the proposed model.


2021 ◽  
Vol 7 ◽  
pp. e786
Author(s):  
Vaibhav Bhat ◽  
Anita Yadav ◽  
Sonal Yadav ◽  
Dhivya Chandrasekaran ◽  
Vijay Mago

Emotion recognition in conversations is an important step in various virtual chatbots which require opinion-based feedback, like in social media threads, online support, and many more applications. Current emotion recognition in conversations models face issues like: (a) loss of contextual information in between two dialogues of a conversation, (b) failure to give appropriate importance to significant tokens in each utterance, (c) inability to pass on the emotional information from previous utterances. The proposed model of Advanced Contextual Feature Extraction (AdCOFE) addresses these issues by performing unique feature extraction using knowledge graphs, sentiment lexicons and phrases of natural language at all levels (word and position embedding) of the utterances. Experiments on emotion recognition in conversations datasets show that AdCOFE is beneficial in capturing emotions in conversations.


2022 ◽  
Vol 2022 ◽  
pp. 1-9
Author(s):  
Junlong Feng ◽  
Jianping Zhao

Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image description: the generated caption only contains main instances but neglects additional objects and scenes unexpectedly, which reduces the caption consistency of the image. To address this issue, we proposed an image captioning system within context-fused guidance in this paper. It incorporates regional and global image representation as the compositional visual features to learn the objects and attributes in images. To integrate image-level semantic information, the visual concept is employed. To avoid misleading decoding, a context fusion gate is introduced to calculate the textual context by selectively aggregating the information of visual concept and word embedding. Subsequently, the context-fused image guidance is formulated based on the compositional visual features and textual context. It provides the decoder with informative semantic knowledge. Finally, a captioner with a two-layer LSTM architecture is constructed to generate captions. Moreover, to overcome the exposure bias, we train the proposed model through sequence decision-making. The experiments conducted on the MS COCO dataset show the outstanding performance of our work. The linguistic analysis demonstrates that our model improves the caption consistency of the image.


Sign in / Sign up

Export Citation Format

Share Document