Image Caption Generation Using Multi-Level Semantic Context Information

Object detection, visual relationship detection, and image captioning, which are the three main visual tasks in scene understanding, are highly correlated and correspond to different semantic levels of scene image. However, the existing captioning methods convert the extracted image features into description text, and the obtained results are not satisfactory. In this work, we propose a Multi-level Semantic Context Information (MSCI) network with an overall symmetrical structure to leverage the mutual connections across the three different semantic layers and extract the context information between them, to solve jointly the three vision tasks for achieving the accurate and comprehensive description of the scene image. The model uses a feature refining structure to mutual connections and iteratively updates the different semantic features of the image. Then a context information extraction network is used to extract the context information between the three different semantic layers, and an attention mechanism is introduced to improve the accuracy of image captioning while using the context information between the different semantic layers to improve the accuracy of object detection and relationship detection. Experiments on the VRD and COCO datasets demonstrate that our proposed model can leverage the context information between semantic layers to improve the accuracy of those visual tasks generation.

Download Full-text

Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTM

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3422668 ◽

2021 ◽

Vol 17 (1s) ◽

pp. 1-18 ◽

Cited By ~ 1

Author(s):

Huimin Lu ◽

Rui Yang ◽

Zhenrong Deng ◽

Yonglin Zhang ◽

Guangwei Gao ◽

...

Keyword(s):

Feature Extraction ◽

Contextual Information ◽

Image Features ◽

Context Information ◽

Global Information ◽

Image Description ◽

Image Captioning ◽

Image Content ◽

Single Feature ◽

Proposed Model

Chinese image description generation tasks usually have some challenges, such as single-feature extraction, lack of global information, and lack of detailed description of the image content. To address these limitations, we propose a fuzzy attention-based DenseNet-BiLSTM Chinese image captioning method in this article. In the proposed method, we first improve the densely connected network to extract features of the image at different scales and to enhance the model’s ability to capture the weak features. At the same time, a bidirectional LSTM is used as the decoder to enhance the use of context information. The introduction of an improved fuzzy attention mechanism effectively improves the problem of correspondence between image features and contextual information. We conduct experiments on the AI Challenger dataset to evaluate the performance of the model. The results show that compared with other models, our proposed model achieves higher scores in objective quantitative evaluation indicators, including BLEU , BLEU , METEOR, ROUGEl, and CIDEr. The generated description sentence can accurately express the image content.

Download Full-text

Realize Your Surroundings: Exploiting Context Information for Small Object Detection

Neurocomputing ◽

10.1016/j.neucom.2020.12.093 ◽

2021 ◽

Author(s):

Jiaxu Leng ◽

Yihui Ren ◽

Wen Jiang ◽

Xiaoding Sun ◽

Ye Wang

Keyword(s):

Object Detection ◽

Context Information ◽

Small Object ◽

Small Object Detection

Download Full-text

A new multi-scale backbone network for object detection based on asymmetric convolutions

Science Progress ◽

10.1177/00368504211011343 ◽

2021 ◽

Vol 104 (2) ◽

pp. 003685042110113

Author(s):

Xianghua Ma ◽

Zhenkun Yang

Keyword(s):

Object Detection ◽

Image Features ◽

Detection Accuracy ◽

Mobile Platforms ◽

Multi Scale ◽

Backbone Network ◽

Aspect Ratios ◽

Pascal Voc ◽

Scale Characteristics ◽

Detection Speed

Real-time object detection on mobile platforms is a crucial but challenging computer vision task. However, it is widely recognized that although the lightweight object detectors have a high detection speed, the detection accuracy is relatively low. In order to improve detecting accuracy, it is beneficial to extract complete multi-scale image features in visual cognitive tasks. Asymmetric convolutions have a useful quality, that is, they have different aspect ratios, which can be used to exact image features of objects, especially objects with multi-scale characteristics. In this paper, we exploit three different asymmetric convolutions in parallel and propose a new multi-scale asymmetric convolution unit, namely MAC block to enhance multi-scale representation ability of CNNs. In addition, MAC block can adaptively merge the features with different scales by allocating learnable weighted parameters to three different asymmetric convolution branches. The proposed MAC blocks can be inserted into the state-of-the-art backbone such as ResNet-50 to form a new multi-scale backbone network of object detectors. To evaluate the performance of MAC block, we conduct experiments on CIFAR-100, PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO 2014 datasets. Experimental results show that the detection precision can be greatly improved while a fast detection speed is guaranteed as well.

Download Full-text

Semantic Context-Aware Network for Multiscale Object Detection in Remote Sensing Images

IEEE Geoscience and Remote Sensing Letters ◽

10.1109/lgrs.2021.3067313 ◽

2021 ◽

pp. 1-5

Author(s):

Ke Zhang ◽

Yulin Wu ◽

Jingyu Wang ◽

Yezi Wang ◽

Qi Wang

Keyword(s):

Remote Sensing ◽

Object Detection ◽

Semantic Context ◽

Context Aware ◽

Remote Sensing Images

Download Full-text

An Efficient Module for Instance Segmentation Based on Multi-Level Features and Attention Mechanisms

Applied Sciences ◽

10.3390/app11030968 ◽

2021 ◽

Vol 11 (3) ◽

pp. 968

Author(s):

Yingchun Sun ◽

Wang Gao ◽

Shuguo Pan ◽

Tao Zhao ◽

Yahui Peng

Keyword(s):

Feature Extraction ◽

Spatial Structure ◽

Semantic Feature ◽

Semantic Features ◽

Segmentation Method ◽

Spatial Dimensions ◽

Feature Pyramid ◽

Multi Level ◽

High Level ◽

Instance Segmentation

Recently, multi-level feature networks have been extensively used in instance segmentation. However, because not all features are beneficial to instance segmentation tasks, the performance of networks cannot be adequately improved by synthesizing multi-level convolutional features indiscriminately. In order to solve the problem, an attention-based feature pyramid module (AFPM) is proposed, which integrates the attention mechanism on the basis of a multi-level feature pyramid network to efficiently and pertinently extract the high-level semantic features and low-level spatial structure features; for instance, segmentation. Firstly, we adopt a convolutional block attention module (CBAM) into feature extraction, and sequentially generate attention maps which focus on instance-related features along the channel and spatial dimensions. Secondly, we build inter-dimensional dependencies through a convolutional triplet attention module (CTAM) in lateral attention connections, which is used to propagate a helpful semantic feature map and filter redundant informative features irrelevant to instance objects. Finally, we construct branches for feature enhancement to strengthen detailed information to boost the entire feature hierarchy of the network. The experimental results on the Cityscapes dataset manifest that the proposed module outperforms other excellent methods under different evaluation metrics and effectively upgrades the performance of the instance segmentation method.

Download Full-text

Exploring Context Information for Accurate and Fast Object Detection

Pattern Recognition and Computer Vision - Lecture Notes in Computer Science ◽

10.1007/978-3-030-31654-9_20 ◽

2019 ◽

pp. 228-238

Author(s):

Zhenjun Shi ◽

Xiaoqi Li ◽

Bin Zhang

Keyword(s):

Object Detection ◽

Context Information

Download Full-text

Visual attention strategies for target object detection

10.26686/wgtn.17067635 ◽

2021 ◽

Author(s):

◽

Ibrahim Mohammad Hussain Rahman

Keyword(s):

Visual Attention ◽

Object Detection ◽

Target Object ◽

Detection Accuracy ◽

Estimation Model ◽

Top Down ◽

Bottom Up ◽

Feature Map ◽

Low Level ◽

Visual Tasks

<p>The human visual attention system (HVA) encompasses a set of interconnected neurological modules that are responsible for analyzing visual stimuli by attending to those regions that are salient. Two contrasting biological mechanisms exist in the HVA systems; bottom-up, data-driven attention and top-down, task-driven attention. The former is mostly responsible for low-level instinctive behaviors, while the latter is responsible for performing complex visual tasks such as target object detection. Very few computational models have been proposed to model top-down attention, mainly due to three reasons. The first is that the functionality of top-down process involves many influential factors. The second reason is that there is a diversity in top-down responses from task to task. Finally, many biological aspects of the top-down process are not well understood yet. For the above reasons, it is difficult to come up with a generalized top-down model that could be applied to all high level visual tasks. Instead, this thesis addresses some outstanding issues in modelling top-down attention for one particular task, target object detection. Target object detection is an essential step for analyzing images to further perform complex visual tasks. Target object detection has not been investigated thoroughly when modelling top-down saliency and hence, constitutes the may domain application for this thesis. The thesis will investigate methods to model top-down attention through various high-level data acquired from images. Furthermore, the thesis will investigate different strategies to dynamically combine bottom-up and top-down processes to improve the detection accuracy, as well as the computational efficiency of the existing and new visual attention models. The following techniques and approaches are proposed to address the outstanding issues in modelling top-down saliency: 1. A top-down saliency model that weights low-level attentional features through contextual knowledge of a scene. The proposed model assigns weights to features of a novel image by extracting a contextual descriptor of the image. The contextual descriptor plays the role of tuning the weighting of low-level features to maximize detection accuracy. By incorporating context into the feature weighting mechanism we improve the quality of the assigned weights to these features. 2. Two modules of target features combined with contextual weighting to improve detection accuracy of the target object. In this proposed model, two sets of attentional feature weights are learned, one through context and the other through target features. When both sources of knowledge are used to model top-down attention, a drastic increase in detection accuracy is achieved in images with complex backgrounds and a variety of target objects. 3. A top-down and bottom-up attention combination model based on feature interaction. This model provides a dynamic way for combining both processes by formulating the problem as feature selection. The feature selection exploits the interaction between these features, yielding a robust set of features that would maximize both the detection accuracy and the overall efficiency of the system. 4. A feature map quality score estimation model that is able to accurately predict the detection accuracy score of any previously novel feature map without the need of groundtruth data. The model extracts various local, global, geometrical and statistical characteristic features from a feature map. These characteristics guide a regression model to estimate the quality of a novel map. 5. A dynamic feature integration framework for combining bottom-up and top-down saliencies at runtime. If the estimation model is able to predict the quality score of any novel feature map accurately, then it is possible to perform dynamic feature map integration based on the estimated value. We propose two frameworks for feature map integration using the estimation model. The proposed integration framework achieves higher human fixation prediction accuracy with minimum number of feature maps than that achieved by combining all feature maps. The proposed works in this thesis provide new directions in modelling top-down saliency for target object detection. In addition, dynamic approaches for top-down and bottom-up combination show considerable improvements over existing approaches in both efficiency and accuracy.</p>

Download Full-text

Multi-level Features Selection Network Based on Multi-attention for Salient Object Detection

10.1007/978-3-030-87355-4_27 ◽

2021 ◽

pp. 315-326

Author(s):

Jianyi Ren ◽

Zheng Wang ◽

Meijun Sun

Keyword(s):

Object Detection ◽

Salient Object Detection ◽

Features Selection ◽

Salient Object ◽

Multi Level

Download Full-text

Bayesian Networks for Image Understanding

Bayesian Network Technologies ◽

10.4018/978-1-59904-141-4.ch007 ◽

2007 ◽

pp. 128-150

Author(s):

Andreas Savaki ◽

Jiebo Luo ◽

Michael Kane

Keyword(s):

Bayesian Networks ◽

Object Detection ◽

Bayesian Network ◽

Structure Learning ◽

Image Understanding ◽

Semantic Features ◽

Scene Classification ◽

Low Level ◽

Object Parts ◽

Scene Content

Image understanding deals with extracting and interpreting scene content for use in various applications. In this chapter, we illustrate that Bayesian networks are particularly well-suited for image understanding problems, and present case studies in indoor-outdoor scene classification and parts-based object detection. First, improved scene classification is accomplished using both low-level features, such as color and texture, and semantic features, such as the presence of sky and grass. Integration of low-level and semantic features is achieved using a Bayesian network framework. The network structure can be determined by expert opinion or by automated structure learning methods. Second, object detection at multiple views relies on a parts-based approach, where specialized detectors locate object parts and a Bayesian network acts as the arbitrator in order to determine the object presence. In general, Bayesian networks are found to be powerful integrators of different features and help improve the performance of image understanding systems.

Download Full-text