Improving Object Detection Quality by Incorporating Global Contexts via Self-Attention

Donghyeon Lee; Joonyoung Kim; Kyomin Jung

doi:10.3390/electronics10010090

Improving Object Detection Quality by Incorporating Global Contexts via Self-Attention

Electronics ◽

10.3390/electronics10010090 ◽

2021 ◽

Vol 10 (1) ◽

pp. 90

Author(s):

Donghyeon Lee ◽

Joonyoung Kim ◽

Kyomin Jung

Keyword(s):

Object Detection ◽

State Of The Art ◽

Semantic Segmentation ◽

Feature Maps ◽

Art Object ◽

Local Contexts ◽

Feature Extractor ◽

Specific Object ◽

Detection Quality

Fully convolutional structures provide feature maps acquiring local contexts of an image by only stacking numerous convolutional layers. These structures are known to be effective in modern state-of-the-art object detectors such as Faster R-CNN and SSD to find objects from local contexts. However, the quality of object detectors can be further improved by incorporating global contexts when some ambiguous objects should be identified by surrounding objects or background. In this paper, we introduce a self-attention module for object detectors to incorporate global contexts. More specifically, our self-attention module allows the feature extractor to compute feature maps with global contexts by the self-attention mechanism. Our self-attention module computes relationships among all elements in the feature maps, and then blends the feature maps considering the computed relationships. Therefore, this module can capture long-range relationships among objects or backgrounds, which is difficult for fully convolutional structures. Furthermore, our proposed module is not limited to any specific object detectors, and it can be applied to any CNN-based model for any computer vision task. In the experimental results on the object detection task, our method shows remarkable gains in average precision (AP) compared to popular models that have fully convolutional structures. In particular, compared to Faster R-CNN with the ResNet-50 backbone, our module applied to the same backbone achieved +4.0 AP gains without the bells and whistles. In image semantic segmentation and panoptic segmentation tasks, our module improved the performance in all metrics used for each task.

Download Full-text

Improve YOLOv3 using dilated spatial pyramid module for multi-scale object detection

International Journal of Advanced Robotic Systems ◽

10.1177/1729881420936062 ◽

2020 ◽

Vol 17 (4) ◽

pp. 172988142093606

Author(s):

Xiaoguo Zhang ◽

Ye Gao ◽

Huiqing Wang ◽

Qing Wang

Keyword(s):

Object Detection ◽

State Of The Art ◽

Receptive Fields ◽

Mean Average Precision ◽

Feature Maps ◽

Average Precision ◽

Multi Scale ◽

Art Object ◽

Spatial Pyramid ◽

Scale Variation

Effectively and efficiently recognizing multi-scale objects is one of the key challenges of utilizing deep convolutional neural network to the object detection field. YOLOv3 (You only look once v3) is the state-of-the-art object detector with good performance in both aspects of accuracy and speed; however, the scale variation is still the challenging problem which needs to be improved. Considering that the detection performances of multi-scale objects are related to the receptive fields of the network, in this work, we propose a novel dilated spatial pyramid module to integrate multi-scale information to effectively deal with scale variation problem. Firstly, the input of dilated spatial pyramid is fed into multiple parallel branches with different dilation rates to generate feature maps with different receptive fields. Then, the input of dilated spatial pyramid and outputs of different branches are concatenated to integrate multi-scale information. Moreover, dilated spatial pyramid is integrated with YOLOv3 in front of the first detection header to present dilated spatial pyramid-You only look once model. Experiment results on PASCAL VOC2007 demonstrate that dilated spatial pyramid-You only look once model outperforms other state-of-the-art methods in mean average precision, while it still keeps a satisfying real-time detection speed. For 416 × 416 input, dilated spatial pyramid-You only look once model achieves 82.2% mean average precision at 56 frames per second, 3.9% higher than YOLOv3 with only slight speed drops.

Download Full-text

Automatic Carotid Artery Detection Using Attention Layer Region-Based Convolution Neural Network

International Journal of Humanoid Robotics ◽

10.1142/s0219843619500154 ◽

2019 ◽

Vol 16 (04) ◽

pp. 1950015

Author(s):

Xiaoyan Wang ◽

Xingyu Zhong ◽

Ming Xia ◽

Weiwei Jiang ◽

Xiaojie Huang ◽

...

Keyword(s):

Carotid Artery ◽

Object Detection ◽

State Of The Art ◽

Region Of Interest ◽

Feature Maps ◽

Interactive Approach ◽

Detection Systems ◽

Art Object ◽

Layer Region ◽

Artery Detection

Localization of vessel Region of Interest (ROI) from medical images provides an interactive approach that can assist doctors in evaluating carotid artery diseases. Accurate vessel detection is a prerequisite for the following procedures, like wall segmentation, plaque identification and 3D reconstruction. Deep learning models such as CNN have been widely used in medical image processing, and achieve state-of-the-art performance. Faster R-CNN is one of the most representative and successful methods for object detection. Using outputs of feature maps in different layers has been proved to be a useful way to improve the detection performance, however, the common method is to ensemble outputs of different layers directly, and the special characteristic and different importance of each layer haven’t been considered. In this work, we introduce a new network named Attention Layer R-CNN(AL R-CNN) and use it for automatic carotid artery detection, in which we integrate a new module named Attention Layer Part (ALP) into a basic Faster R-CNN system for better assembling feature maps of different layers. Experimental results on carotid dataset show that our method surpasses other state-of-the-art object detection systems.

Download Full-text

Deep-Learning-Based Road Crack Detection Frameworks for Dashcam-captured Images under Different Illumination Conditions

10.21203/rs.3.rs-685762/v1 ◽

2021 ◽

Author(s):

Da-Ren Chen ◽

Wei-Min Chiu

Keyword(s):

Object Detection ◽

Large Scale ◽

Crack Detection ◽

State Of The Art ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Machine Learning Techniques ◽

Detection Accuracy ◽

The Road ◽

Art Object

Abstract Machine learning techniques have been used to increase detection accuracy of cracks in road surfaces. Most studies failed to consider variable illumination conditions on the target of interest (ToI), and only focus on detecting the presence or absence of road cracks. This paper proposes a new road crack detection method, IlumiCrack, which integrates Gaussian mixture models (GMM) and object detection CNN models. This work provides the following contributions: 1) For the first time, a large-scale road crack image dataset with a range of illumination conditions (e.g., day and night) is prepared using a dashcam. 2) Based on GMM, experimental evaluations on 2 to 4 levels of brightness are conducted for optimal classification. 3) the IlumiCrack framework is used to integrate state-of-the-art object detecting methods with CNN to classify the road crack images into eight types with high accuracy. Experimental results show that IlumiCrack outperforms the state-of-the-art R-CNN object detection frameworks.

Download Full-text

Salient Object Detection with Semantic Priors

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/628 ◽

2017 ◽

Cited By ~ 5

Author(s):

Tam V. Nguyen ◽

Luoqi Liu

Keyword(s):

Object Detection ◽

State Of The Art ◽

Semantic Segmentation ◽

Saliency Map ◽

Salient Object Detection ◽

Salient Object ◽

Artificial Intelligence Research ◽

Semantic Map ◽

Regional Features ◽

Computational Sciences

Salient object detection has increasingly become a popular topic in cognitive and computational sciences, including computer vision and artificial intelligence research. In this paper, we propose integrating semantic priors into the salient object detection process. Our algorithm consists of three basic steps. Firstly, the explicit saliency map is obtained based on the semantic segmentation refined by the explicit saliency priors learned from the data. Next, the implicit saliency map is computed based on a trained model which maps the implicit saliency priors embedded into regional features with the saliency values. Finally, the explicit semantic map and the implicit map are adaptively fused to form a pixel-accurate saliency map which uniformly covers the objects of interest. We further evaluate the proposed framework on two challenging datasets, namely, ECSSD and HKUIS. The extensive experimental results demonstrate that our method outperforms other state-of-the-art methods.

Download Full-text

PI-RCNN: An Efficient Multi-Sensor 3D Object Detector with Point-Based Attentive Cont-Conv Fusion Module

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6933 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12460-12467

Author(s):

Liang Xie ◽

Chao Xiang ◽

Zhengxu Yu ◽

Guodong Xu ◽

Zheng Yang ◽

...

Keyword(s):

Object Detection ◽

State Of The Art ◽

Point Clouds ◽

Semantic Features ◽

Feature Maps ◽

3D Object ◽

Detection Algorithms ◽

Full Resolution ◽

Fusion Methods ◽

3D Object Detection

LIDAR point clouds and RGB-images are both extremely essential for 3D object detection. So many state-of-the-art 3D detection algorithms dedicate in fusing these two types of data effectively. However, their fusion methods based on Bird's Eye View (BEV) or voxel format are not accurate. In this paper, we propose a novel fusion approach named Point-based Attentive Cont-conv Fusion(PACF) module, which fuses multi-sensor features directly on 3D points. Except for continuous convolution, we additionally add a Point-Pooling and an Attentive Aggregation to make the fused features more expressive. Moreover, based on the PACF module, we propose a 3D multi-sensor multi-task network called Pointcloud-Image RCNN(PI-RCNN as brief), which handles the image segmentation and 3D object detection tasks. PI-RCNN employs a segmentation sub-network to extract full-resolution semantic feature maps from images and then fuses the multi-sensor features via powerful PACF module. Beneficial from the effectiveness of the PACF module and the expressive semantic features from the segmentation module, PI-RCNN can improve much in 3D object detection. We demonstrate the effectiveness of the PACF module and PI-RCNN on the KITTI 3D Detection benchmark, and our method can achieve state-of-the-art on the metric of 3D AP.

Download Full-text

An Evaluation of Deep Learning Methods for Small Object Detection

Journal of Electrical and Computer Engineering ◽

10.1155/2020/3189691 ◽

2020 ◽

Vol 2020 ◽

pp. 1-18 ◽

Cited By ~ 2

Author(s):

Nhat-Duy Nguyen ◽

Tien Do ◽

Thanh Duc Ngo ◽

Duy-Dinh Le

Keyword(s):

Deep Learning ◽

Object Detection ◽

State Of The Art ◽

Rapid Development ◽

Empirical Evaluation ◽

Grid Cell ◽

Small Object ◽

Feature Maps ◽

Comparative Results ◽

Small Object Detection

Small object detection is an interesting topic in computer vision. With the rapid development in deep learning, it has drawn attention of several researchers with innovations in approaches to join a race. These innovations proposed comprise region proposals, divided grid cell, multiscale feature maps, and new loss function. As a result, performance of object detection has recently had significant improvements. However, most of the state-of-the-art detectors, both in one-stage and two-stage approaches, have struggled with detecting small objects. In this study, we evaluate current state-of-the-art models based on deep learning in both approaches such as Fast RCNN, Faster RCNN, RetinaNet, and YOLOv3. We provide a profound assessment of the advantages and limitations of models. Specifically, we run models with different backbones on different datasets with multiscale objects to find out what types of objects are suitable for each model along with backbones. Extensive empirical evaluation was conducted on 2 standard datasets, namely, a small object dataset and a filtered dataset from PASCAL VOC 2007. Finally, comparative results and analyses are then presented.

Download Full-text

TasselNetV2+: A Fast Implementation for High-Throughput Plant Counting From High-Resolution RGB Imagery

Frontiers in Plant Science ◽

10.3389/fpls.2020.541960 ◽

2020 ◽

Vol 11 ◽

Author(s):

Hao Lu ◽

Zhiguo Cao

Keyword(s):

High Resolution ◽

Object Detection ◽

High Throughput ◽

Graphics Processing Units ◽

State Of The Art ◽

Image Resolution ◽

Plant Phenotyping ◽

Art Object ◽

Bounding Boxes ◽

Computational Bottleneck

Plant counting runs through almost every stage of agricultural production from seed breeding, germination, cultivation, fertilization, pollination to yield estimation, and harvesting. With the prevalence of digital cameras, graphics processing units and deep learning-based computer vision technology, plant counting has gradually shifted from traditional manual observation to vision-based automated solutions. One of popular solutions is a state-of-the-art object detection technique called Faster R-CNN where plant counts can be estimated from the number of bounding boxes detected. It has become a standard configuration for many plant counting systems in plant phenotyping. Faster R-CNN, however, is expensive in computation, particularly when dealing with high-resolution images. Unfortunately high-resolution imagery is frequently used in modern plant phenotyping platforms such as unmanned aerial vehicles, engendering inefficient image analysis. Such inefficiency largely limits the throughput of a phenotyping system. The goal of this work hence is to provide an effective and efficient tool for high-throughput plant counting from high-resolution RGB imagery. In contrast to conventional object detection, we encourage another promising paradigm termed object counting where plant counts are directly regressed from images, without detecting bounding boxes. In this work, by profiling the computational bottleneck, we implement a fast version of a state-of-the-art plant counting model TasselNetV2 with several minor yet effective modifications. We also provide insights why these modifications make sense. This fast version, TasselNetV2+, runs an order of magnitude faster than TasselNetV2, achieving around 30 fps on image resolution of 1980 × 1080, while it still retains the same level of counting accuracy. We validate its effectiveness on three plant counting tasks, including wheat ears counting, maize tassels counting, and sorghum heads counting. To encourage the use of this tool, our implementation has been made available online at https://tinyurl.com/TasselNetV2plus.

Download Full-text

A Multibranch Object Detection Method for Traffic Scenes

Computational Intelligence and Neuroscience ◽

10.1155/2019/3679203 ◽

2019 ◽

Vol 2019 ◽

pp. 1-16

Author(s):

Jiangfan Feng ◽

Fanjie Wang ◽

Siqin Feng ◽

Yongrong Peng

Keyword(s):

Neural Network ◽

Object Detection ◽

Convolutional Neural Network ◽

Detection Method ◽

State Of The Art ◽

Recall Rate ◽

Small Scale ◽

Feature Maps ◽

Time Requirements ◽

Speed Up

The performance of convolutional neural network- (CNN-) based object detection has achieved incredible success. Howbeit, existing CNN-based algorithms suffer from a problem that small-scale objects are difficult to detect because it may have lost its response when the feature map has reached a certain depth, and it is common that the scale of objects (such as cars, buses, and pedestrians) contained in traffic images and videos varies greatly. In this paper, we present a 32-layer multibranch convolutional neural network named MBNet for fast detecting objects in traffic scenes. Our model utilizes three detection branches, in which feature maps with a size of 16 × 16, 32 × 32, and 64 × 64 are used, respectively, to optimize the detection for large-, medium-, and small-scale objects. By means of a multitask loss function, our model can be trained end-to-end. The experimental results show that our model achieves state-of-the-art performance in terms of precision and recall rate, and the detection speed (up to 33 fps) is fast, which can meet the real-time requirements of industry.

Download Full-text

HQ-ISNet: High-Quality Instance Segmentation for Remote Sensing Imagery

Remote Sensing ◽

10.3390/rs12060989 ◽

2020 ◽

Vol 12 (6) ◽

pp. 989 ◽

Cited By ~ 1

Author(s):

Hao Su ◽

Shunjun Wei ◽

Shan Liu ◽

Jiadian Liang ◽

Chen Wang ◽

...

Keyword(s):

Remote Sensing ◽

High Resolution ◽

Object Detection ◽

Prediction Accuracy ◽

Semantic Segmentation ◽

Remote Sensing Images ◽

Feature Maps ◽

High Quality ◽

Remote Sensing Imagery ◽

Instance Segmentation

Instance segmentation in high-resolution (HR) remote sensing imagery is one of the most challenging tasks and is more difficult than object detection and semantic segmentation tasks. It aims to predict class labels and pixel-wise instance masks to locate instances in an image. However, there are rare methods currently suitable for instance segmentation in the HR remote sensing images. Meanwhile, it is more difficult to implement instance segmentation due to the complex background of remote sensing images. In this article, a novel instance segmentation approach of HR remote sensing imagery based on Cascade Mask R-CNN is proposed, which is called a high-quality instance segmentation network (HQ-ISNet). In this scheme, the HQ-ISNet exploits a HR feature pyramid network (HRFPN) to fully utilize multi-level feature maps and maintain HR feature maps for remote sensing images’ instance segmentation. Next, to refine mask information flow between mask branches, the instance segmentation network version 2 (ISNetV2) is proposed to promote further improvements in mask prediction accuracy. Then, we construct a new, more challenging dataset based on the synthetic aperture radar (SAR) ship detection dataset (SSDD) and the Northwestern Polytechnical University very-high-resolution 10-class geospatial object detection dataset (NWPU VHR-10) for remote sensing images instance segmentation which can be used as a benchmark for evaluating instance segmentation algorithms in the high-resolution remote sensing images. Finally, extensive experimental analyses and comparisons on the SSDD and the NWPU VHR-10 dataset show that (1) the HRFPN makes the predicted instance masks more accurate, which can effectively enhance the instance segmentation performance of the high-resolution remote sensing imagery; (2) the ISNetV2 is effective and promotes further improvements in mask prediction accuracy; (3) our proposed framework HQ-ISNet is effective and more accurate for instance segmentation in the remote sensing imagery than the existing algorithms.

Download Full-text

Kernel Point Convolution LSTM Networks for Radar Point Cloud Segmentation

Applied Sciences ◽

10.3390/app11062599 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2599

Author(s):

Felix Nobis ◽

Felix Fent ◽

Johannes Betz ◽

Markus Lienkamp

Keyword(s):

Object Detection ◽

Point Cloud ◽

State Of The Art ◽

Radar Data ◽

Quality Data ◽

Data Set ◽

3D Object ◽

The Public ◽

3D Object Detection

State-of-the-art 3D object detection for autonomous driving is achieved by processing lidar sensor data with deep-learning methods. However, the detection quality of the state of the art is still far from enabling safe driving in all conditions. Additional sensor modalities need to be used to increase the confidence and robustness of the overall detection result. Researchers have recently explored radar data as an additional input source for universal 3D object detection. This paper proposes artificial neural network architectures to segment sparse radar point cloud data. Segmentation is an intermediate step towards radar object detection as a complementary concept to lidar object detection. Conceptually, we adapt Kernel Point Convolution (KPConv) layers for radar data. Additionally, we introduce a long short-term memory (LSTM) variant based on KPConv layers to make use of the information content in the time dimension of radar data. This is motivated by classical radar processing, where tracking of features over time is imperative to generate confident object proposals. We benchmark several variants of the network on the public nuScenes data set against a state-of-the-art pointnet-based approach. The performance of the networks is limited by the quality of the publicly available data. The radar data and radar-label quality is of great importance to the training and evaluation of machine learning models. Therefore, the advantages and disadvantages of the available data set, regarding its radar data, are discussed in detail. The need for a radar-focused data set for object detection is expressed. We assume that higher segmentation scores should be achievable with better-quality data for all models compared, and differences between the models should manifest more clearly. To facilitate research with additional radar data, the modular code for this research will be made available to the public.

Download Full-text