A Multipath Fusion Strategy Based Single Shot Detector

Object detection has wide applications in intelligent systems and sensor applications. Compared with two stage detectors, recent one stage counterparts are capable of running more efficiently with comparable accuracy, which satisfy the requirement of real-time processing. To further improve the accuracy of one stage single shot detector (SSD), we propose a novel Multi-Path fusion Single Shot Detector (MPSSD). Different from other feature fusion methods, we exploit the connection among different scale representations in a pyramid manner. We propose feature fusion module to generate new feature pyramids based on multiscale features in SSD, and these pyramids are sent to our pyramid aggregation module for generating final features. These enhanced features have both localization and semantics information, thus improving the detection performance with little computation cost. A series of experiments on three benchmark datasets PASCAL VOC2007, VOC2012, and MS COCO demonstrate that our approach outperforms many state-of-the-art detectors both qualitatively and quantitatively. In particular, for input images with size 512 × 512, our method attains mean Average Precision (mAP) of 81.8% on VOC2007 test, 80.3% on VOC2012 test, and 33.1% mAP on COCO test-dev 2015.

Download Full-text

Insulator Fault Detection in Aerial Images based on the Mixed-grouped Fire Single-shot Multibox Detector

Journal of Imaging Science and Technology ◽

10.2352/j.imagingsci.technol.2021.65.3.030402 ◽

2020 ◽

Author(s):

Songbo Chen ◽

Chao Su ◽

Zhenxing Kuang ◽

Ye Ouyang ◽

Xiang Gong

Keyword(s):

Feature Fusion ◽

Complex Structure ◽

Aerial Images ◽

Detection Accuracy ◽

Single Shot ◽

Small Object ◽

Two Stage ◽

One Stage ◽

Complex Background ◽

Lower Accuracy

In a complex background, insulator fault is the main factor behind transmission accidents. With the wide application of unmanned aerial vehicle (UAV) photography, digital image recognition technology has been further developed to detect the position and fault of insulators. There are two mainstream methods based on deep learning: the first is the “two-stage” example for a region convolutional neural network and the second is the “one-stage” example such as a single-shot multibox detector (SSD), both of which pose many difficulties and challenges. However, due to the complex background and various types of insulators, few researchers apply the “two-stage” method for the detection of insulator faults in aerial images. Moreover, the detection performance of “one-stage” methods is poor for small targets because of the smaller scope of vision and lower accuracy in target detection. In this article, the authors propose an accurate and real-time method for small object detection, an example for insulator location, and its fault inspection based on a mixed- grouped fire single-shot multibox detector (MGFSSD). Based on SSD and deconvolutional single-shot detector (DSSD) networks, the MGFSSD algorithm solves the problems of inaccurate recognition in small objects of the SSD and complex structure and long running time of the DSSD. To resolve the problems of some target repeated detection and small-target missing detection of the original SSD, the authors describe how to design an effective and lightweight feature fusion module to improve the performance of traditional SSDs so that the classifier network can take full advantage of the relationship between the pyramid layer features without changing the base network closest to the input data. The data processing results show that the method can effectively detect insulator faults. The average detection accuracy of insulator faults is 92.4% and the average recall rate is 91.2%.

Download Full-text

Application of Two New Feature Fusion Networks to Improve Real-time Prostate Capsula Detection

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405617666210129110832 ◽

2021 ◽

Vol 17 ◽

Author(s):

Shixiao Wu ◽

Chengcheng Guo ◽

Xinghuan Wang

Keyword(s):

Dimension Reduction ◽

Real Time ◽

Image Denoising ◽

Transient Stability ◽

Feature Fusion ◽

Median Filter ◽

Principal Component ◽

Single Shot ◽

Bilinear Interpolation ◽

New Feature

Background: Excess prostate tissue is trimmed near the prostate capsula boundary during transurethral plasma kinetic enucleation of prostate (PKEP) and transurethral bipolar plasmakinetic resection of prostate (PKRP) surgeries. If too much tissue is removed, a prostate capsula perforation can potentially occur. As such, real-time accurate prostate capsula (PC) detection is critical for the prevention of these perforations. Objective: This study investigated the potential for using image denoising, image dimension reduction and feature fusion to improve real-time prostate capsula detection with two objective. First, this paper mainly studied feature selection and input dimension reduction. Second, image denoising were evaluated, as they are of paramount importance to transient stability assessment based on neural networks. Method: Two new feature fusion techniques, maxpooling bilinear interpolation single-shot multibox detector (PBSSD) and bilinear interpolation single shot multibox detector (BSSD) were proposed. Before original images were sent to the neural network, they were processed by principal component analysis (PCA) and adaptive median filter (AMF) for dimension reduction and image denoising. Results: The results showed that application of PCA and AMF with PBSSD increased the mean average precision (mAP) for prostate capsula images by 8.55% and reached 80.15%, compared with single shot multibox detector (SSD) alone. Application of PCA with BSSD increased the mAP for prostate capsula images by 4.6% compared with SSD alone. Conclusion: Compared with other methods, ours were proven to be more accurate for real-time prostate capsula detection. The improved mAP results suggest that the proposed approaches are powerful tools for improving SSD networks.

Download Full-text

A Fast Deep Perception Network for Remote Sensing Scene Classification

Remote Sensing ◽

10.3390/rs12040729 ◽

2020 ◽

Vol 12 (4) ◽

pp. 729 ◽

Cited By ~ 3

Author(s):

Ruchan Dong ◽

Dazhuan Xu ◽

Lichen Jiao ◽

Jin Zhao ◽

Jungang An

Keyword(s):

Remote Sensing ◽

Feature Fusion ◽

Remote Sensing Image ◽

Learning System ◽

Support Vector ◽

Scene Classification ◽

Deep Convolutional Neural Networks ◽

Directional Information ◽

Series Of Experiments ◽

New Feature

Current scene classification for high-resolution remote sensing images usually uses deep convolutional neural networks (DCNN) to extract extensive features and adopts support vector machine (SVM) as classifier. DCNN can well exploit deep features but ignore valuable shallow features like texture and directional information; and SVM can hardly train a large amount of samples in an efficient way. This paper proposes a fast deep perception network (FDPResnet) that integrates DCNN and Broad Learning System (BLS), a novel effective learning system, to extract both deep and shallow features and encapsulates a designed DPModel to fuse the two kinds of features. FDPResnet first extracts the shallow and the deep scene features of a remote sensing image through a pre-trained model on residual neural network-101 (Resnet101). Then, it inputs the two kinds of features into a designed deep perception module (DPModel) to obtain a new set of feature vectors that can describe both higher-level semantic and lower-level space information of the image. The DPModel is the key module responsible for dimension reduction and feature fusion. Finally, the obtained new feature vector is input into BLS for training and classification, and we can obtain a satisfactory classification result. A series of experiments are conducted on the challenging NWPU-RESISC45 remote sensing image dataset, and the results demonstrate that our approach outperforms some popular state-of-the-art deep learning methods, and present high-accurate scene classification within a shorter running time.

Download Full-text

Attention Fusion for One-Stage Multispectral Pedestrian Detection

Sensors ◽

10.3390/s21124184 ◽

2021 ◽

Vol 21 (12) ◽

pp. 4184

Author(s):

Zhiwei Cao ◽

Huihua Yang ◽

Juan Zhao ◽

Shuhong Guo ◽

Lingqiao Li

Keyword(s):

Feature Fusion ◽

State Of The Art ◽

Pedestrian Detection ◽

Complementary Information ◽

Deep Convolutional Neural Networks ◽

One Stage ◽

Current State ◽

Fusion Methods ◽

Feature Information ◽

Bounding Boxes

Multispectral pedestrian detection, which consists of a color stream and thermal stream, is essential under conditions of insufficient illumination because the fusion of the two streams can provide complementary information for detecting pedestrians based on deep convolutional neural networks (CNNs). In this paper, we introduced and adapted a simple and efficient one-stage YOLOv4 to replace the current state-of-the-art two-stage fast-RCNN for multispectral pedestrian detection and to directly predict bounding boxes with confidence scores. To further improve the detection performance, we analyzed the existing multispectral fusion methods and proposed a novel multispectral channel feature fusion (MCFF) module for integrating the features from the color and thermal streams according to the illumination conditions. Moreover, several fusion architectures, such as Early Fusion, Halfway Fusion, Late Fusion, and Direct Fusion, were carefully designed based on the MCFF to transfer the feature information from the bottom to the top at different stages. Finally, the experimental results on the KAIST and Utokyo pedestrian benchmarks showed that Halfway Fusion was used to obtain the best performance of all architectures and the MCFF could adapt fused features in the two modalities. The log-average miss rate (MR) for the two modalities with reasonable settings were 4.91% and 23.14%, respectively.

Download Full-text

M2Det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019259 ◽

2019 ◽

Vol 33 ◽

pp. 9259-9266 ◽

Cited By ~ 78

Author(s):

Qijie Zhao ◽

Tao Sheng ◽

Yongtao Wang ◽

Zhi Tang ◽

Ying Chen ◽

...

Keyword(s):

Feature Fusion ◽

State Of The Art ◽

Single Shot ◽

Multi Scale ◽

One Stage ◽

Single Scale ◽

Feature Pyramid ◽

Multi Level ◽

Multiple Levels ◽

Inference Strategy

Feature pyramids are widely exploited by both the state-of-the-art one-stage object detectors (e.g., DSSD, RetinaNet, RefineDet) and the two-stage object detectors (e.g., Mask RCNN, DetNet) to alleviate the problem arising from scale variation across object instances. Although these object detectors with feature pyramids achieve encouraging results, they have some limitations due to that they only simply construct the feature pyramid according to the inherent multiscale, pyramidal architecture of the backbones which are originally designed for object classification task. Newly, in this work, we present Multi-Level Feature Pyramid Network (MLFPN) to construct more effective feature pyramids for detecting objects of different scales. First, we fuse multi-level features (i.e. multiple layers) extracted by backbone as the base feature. Second, we feed the base feature into a block of alternating joint Thinned U-shape Modules and Feature Fusion Modules and exploit the decoder layers of each Ushape module as the features for detecting objects. Finally, we gather up the decoder layers with equivalent scales (sizes) to construct a feature pyramid for object detection, in which every feature map consists of the layers (features) from multiple levels. To evaluate the effectiveness of the proposed MLFPN, we design and train a powerful end-to-end one-stage object detector we call M2Det by integrating it into the architecture of SSD, and achieve better detection performance than state-of-the-art one-stage detectors. Specifically, on MSCOCO benchmark, M2Det achieves AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy, which are the new stateof-the-art results among one-stage detectors. The code will be made available on https://github.com/qijiezhao/M2Det.

Download Full-text

Single-Shot Object Detection with Split and Combine Blocks

Applied Sciences ◽

10.3390/app10186382 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6382

Author(s):

Hongwei Wang ◽

Dahua Li ◽

Yu Song ◽

Qiang Gao ◽

Zhaoyang Wang ◽

...

Keyword(s):

Object Detection ◽

Visual Recognition ◽

Feature Fusion ◽

Feature Representation ◽

Single Shot ◽

Network Efficiency ◽

Feature Maps ◽

One Stage ◽

The One ◽

Speed Accuracy

Feature fusion is widely used in various neural network-based visual recognition tasks, such as object detection, to enhance the quality of feature representation. It is common practice for both the one-stage object detectors and the two-stage object detectors to implement feature fusion in feature pyramid networks (FPN) to enhance the capacity to detect objects of different scales. In this work, we propose a novel and efficient feature fusion unit, which is referred to as the Split and Combine (SC) Block, that splits the input feature maps into several parts, then processes these sub-feature maps with different emphasis, and finally gradually concatenates the outputs one-by-one. The SC block implicitly encourages the network to focus on features that are more important to the task, thus improving network efficiency and reducing inference computations. In order to prove our analysis and conclusions, a backbone network and an FPN employing this technique are assembled into a one-stage detector and evaluated on the MS COCO dataset. With the newly introduced SC block and other novel training tricks, our detector achieves a good speed-accuracy trade-off on COCO test-dev set, with 37.1% AP (average precision) at 51 FPS and 38.9% AP at 40 FPS.

Download Full-text

Channel Exchanging for RGB-T Tracking

Sensors ◽

10.3390/s21175800 ◽

2021 ◽

Vol 21 (17) ◽

pp. 5800

Author(s):

Long Zhao ◽

Meng Zhu ◽

Honge Ren ◽

Lingjixuan Xue

Keyword(s):

Object Tracking ◽

Feature Fusion ◽

Video Data ◽

Visual Object ◽

Complex Environments ◽

The Poor ◽

Fusion Methods ◽

Core Elements ◽

Benchmark Datasets ◽

Single Modality

It is difficult to achieve all-weather visual object tracking in an open environment only utilizing single modality data input. Due to the complementarity of RGB and thermal infrared (TIR) data in various complex environments, a more robust object tracking framework can be obtained using video data of these two modalities. The fusion methods of RGB and TIR data are the core elements to determine the performance of the RGB-T object tracking method, and the existing RGB-T trackers have not solved this problem well. In order to solve the current low utilization of information intra single modality in aggregation-based methods and between two modalities in alignment-based methods, we used DiMP as the baseline tracker to design an RGB-T object tracking framework channel exchanging DiMP (CEDiMP) based on channel exchanging. CEDiMP achieves dynamic channel exchanging between sub-networks of different modes hardly adding any parameters during the feature fusion process. The expression ability of the deep features generated by our data fusion method based on channel exchanging is stronger. At the same time, in order to solve the poor generalization ability of the existing RGB-T object tracking methods and the poor ability in the long-term object tracking, more training of CEDiMP on the synthetic dataset LaSOT-RGBT is added. A large number of experiments demonstrate the effectiveness of the proposed model. CEDiMP achieves the best performance on two RGB-T object tracking benchmark datasets, GTOT and RGBT234, and performs outstandingly in the generalization testing.

Download Full-text

Single-shot Semantic Matching Network for Moment Localization in Videos

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3441577 ◽

2021 ◽

Vol 17 (3) ◽

pp. 1-14

Author(s):

Xinfang Liu ◽

Xiushan Nie ◽

Junya Teng ◽

Li Lian ◽

Yilong Yin

Keyword(s):

Natural Language ◽

Fixed Number ◽

Single Shot ◽

Semantic Matching ◽

Matching Network ◽

Attention Model ◽

Semantic Relationships ◽

Natural Language Query ◽

Benchmark Datasets ◽

Short Memory

Moment localization in videos using natural language refers to finding the most relevant segment from videos given a natural language query. Most of the existing methods require video segment candidates for further matching with the query, which leads to extra computational costs, and they may also not locate the relevant moments under any length evaluated. To address these issues, we present a lightweight single-shot semantic matching network (SSMN) to avoid the complex computations required to match the query and the segment candidates, and the proposed SSMN can locate moments of any length theoretically. Using the proposed SSMN, video features are first uniformly sampled to a fixed number, while the query sentence features are generated and enhanced by GloVe, long-term short memory (LSTM), and soft-attention modules. Subsequently, the video features and sentence features are fed to an enhanced cross-modal attention model to mine the semantic relationships between vision and language. Finally, a score predictor and a location predictor are designed to locate the start and stop indexes of the query moment. We evaluate the proposed method on two benchmark datasets and the experimental results demonstrate that SSMN outperforms state-of-the-art methods in both precision and efficiency.

Download Full-text

Improved SSD-assisted algorithm for surface defect detection of electromagnetic luminescence

Proceedings of the Institution of Mechanical Engineers Part O Journal of Risk and Reliability ◽

10.1177/1748006x21995388 ◽

2021 ◽

pp. 1748006X2199538

Author(s):

Zhenying Xu ◽

Ziqian Wu ◽

Wei Fan

Keyword(s):

Defect Detection ◽

Feature Fusion ◽

Recognition Rate ◽

Detection Methods ◽

Small Scale ◽

Detection Accuracy ◽

Single Shot ◽

Surface Defect Detection ◽

Feature Pyramid ◽

Small Feature

Defect detection of electromagnetic luminescence (EL) cells is the core step in the production and preparation of solar cell modules to ensure conversion efficiency and long service life of batteries. However, due to the lack of feature extraction capability for small feature defects, the traditional single shot multibox detector (SSD) algorithm performs not well in EL defect detection with high accuracy. Consequently, an improved SSD algorithm with modification in feature fusion in the framework of deep learning is proposed to improve the recognition rate of EL multi-class defects. A dataset containing images with four different types of defects through rotation, denoising, and binarization is established for the EL. The proposed algorithm can greatly improve the detection accuracy of the small-scale defect with the idea of feature pyramid networks. An experimental study on the detection of the EL defects shows the effectiveness of the proposed algorithm. Moreover, a comparison study shows the proposed method outperforms other traditional detection methods, such as the SIFT, Faster R-CNN, and YOLOv3, in detecting the EL defect.

Download Full-text

Predicting Implicit User Preferences with Multimodal Feature Fusion for Similar User Recommendation in Social Media

Applied Sciences ◽

10.3390/app11031064 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1064

Author(s):

Jenq-Haur Wang ◽

Yen-Tsang Wu ◽

Long Wang

Keyword(s):

Social Media ◽

Feature Fusion ◽

Relevant Information ◽

Image Features ◽

User Preferences ◽

User Preference ◽

Late Fusion ◽

Multimodal Features ◽

Fusion Methods ◽

Text Features

In social networks, users can easily share information and express their opinions. Given the huge amount of data posted by many users, it is difficult to search for relevant information. In addition to individual posts, it would be useful if we can recommend groups of people with similar interests. Past studies on user preference learning focused on single-modal features such as review contents or demographic information of users. However, such information is usually not easy to obtain in most social media without explicit user feedback. In this paper, we propose a multimodal feature fusion approach to implicit user preference prediction which combines text and image features from user posts for recommending similar users in social media. First, we use the convolutional neural network (CNN) and TextCNN models to extract image and text features, respectively. Then, these features are combined using early and late fusion methods as a representation of user preferences. Lastly, a list of users with the most similar preferences are recommended. The experimental results on real-world Instagram data show that the best performance can be achieved when we apply late fusion of individual classification results for images and texts, with the best average top-k accuracy of 0.491. This validates the effectiveness of utilizing deep learning methods for fusing multimodal features to represent social user preferences. Further investigation is needed to verify the performance in different types of social media.

Download Full-text