Local Representation is Not Enough: Soft Point-Wise Transformer for Descriptor and Detector of Local Features

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/159 ◽

2021 ◽

Author(s):

Zihao Wang ◽

Xueyi Li ◽

Zhen Li

Keyword(s):

Image Matching ◽

State Of The Art ◽

Local Features ◽

Feature Representation ◽

Detection Accuracy ◽

Visual Localization ◽

Feature Maps ◽

Localization Accuracy ◽

Multi Level ◽

Soft Point

Significant progress has been witnessed for the descriptor and detector of local features, but there still exist several challenging and intractable limitations, such as insufficient localization accuracy and non-discriminative description, especially in repetitive- or blank-texture regions, which haven't be well addressed. The coarse feature representation and limited receptive field are considered as the main issues for these limitations. To address these issues, we propose a novel Soft Point-Wise Transformer for Descriptor and Detector, simultaneously mining long-range intrinsic and cross-scale dependencies of local features. Furthermore, our model leverages the distinct transformers based on the soft point-wise attention, substantially decreasing the memory and computation complexity, especially for high-resolution feature maps. In addition, multi-level decoder is constructed to guarantee the high detection accuracy and discriminative description. Extensive experiments demonstrate that our model outperforms the existing state-of-the-art methods on the image matching and visual localization benchmarks.

Download Full-text

Semantics-Aligned Representation Learning for Person Re-Identification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6775 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11173-11180 ◽

Cited By ~ 3

Author(s):

Xin Jin ◽

Cuiling Lan ◽

Wenjun Zeng ◽

Guoqiang Wei ◽

Zhibo Chen

Keyword(s):

State Of The Art ◽

Representation Learning ◽

The State ◽

Feature Representation ◽

Texture Image ◽

Computationally Efficient ◽

Feature Maps ◽

Benchmark Datasets ◽

Texture Generation ◽

Base Network

Person re-identification (reID) aims to match person images to retrieve the ones with the same identity. This is a challenging task, as the images to be matched are generally semantically misaligned due to the diversity of human poses and capture viewpoints, incompleteness of the visible bodies (due to occlusion), etc. In this paper, we propose a framework that drives the reID network to learn semantics-aligned feature representation through delicate supervision designs. Specifically, we build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing/regressing the densely semantics aligned full texture image. We jointly train the SAN under the supervisions of person re-identification and aligned texture generation. Moreover, at the decoder, besides the reconstruction loss, we add Triplet ReID constraints over the feature maps as the perceptual losses. The decoder is discarded in the inference and thus our scheme is computationally efficient. Ablation studies demonstrate the effectiveness of our design. We achieve the state-of-the-art performances on the benchmark datasets CUHK03, Market1501, MSMT17, and the partial person reID dataset Partial REID.

Download Full-text

ZoomInNet: A Novel Small Object Detector in Drone Images with Cross-Scale Knowledge Distillation

Remote Sensing ◽

10.3390/rs13061198 ◽

2021 ◽

Vol 13 (6) ◽

pp. 1198

Author(s):

Bi-Yuan Liu ◽

Huai-Xin Chen ◽

Zhou Huang ◽

Xing Liu ◽

Yun-Zhi Yang

Keyword(s):

Object Detection ◽

Feature Representation ◽

Detection Accuracy ◽

Small Object ◽

Feature Maps ◽

Ground Object ◽

Knowledge Distillation ◽

The Cross ◽

The Difference ◽

Small Object Detection

Drone-based object detection has been widely applied in ground object surveillance, urban patrol, and some other fields. However, the dramatic scale changes and complex backgrounds of drone images usually result in weak feature representation of small objects, which makes it challenging to achieve high-precision object detection. Aiming to improve small objects detection, this paper proposes a novel cross-scale knowledge distillation (CSKD) method, which enhances the features of small objects in a manner similar to image enlargement, so it is termed as ZoomInNet. First, based on an efficient feature pyramid network structure, the teacher and student network are trained with images in different scales to introduce the cross-scale feature. Then, the proposed layer adaption (LA) and feature level alignment (FA) mechanisms are applied to align the feature size of the two models. After that, the adaptive key distillation point (AKDP) algorithm is used to get the crucial positions in feature maps that need knowledge distillation. Finally, the position-aware L2 loss is used to measure the difference between feature maps from cross-scale models, realizing the cross-scale information compression in a single model. Experiments on the challenging Visdrone2018 dataset show that the proposed method draws on the advantages of the image pyramid methods, while avoids the large calculation of them and significantly improves the detection accuracy of small objects. Simultaneously, the comparison with mainstream methods proves that our method has the best performance in small object detection.

Download Full-text

Detection of Schools in Remote Sensing Images Based on Attention-Guided Dense Network

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10110736 ◽

2021 ◽

Vol 10 (11) ◽

pp. 736

Author(s):

Han Fu ◽

Xiangtao Fan ◽

Zhenzhen Yan ◽

Xiaoping Du

Keyword(s):

Remote Sensing ◽

Object Detection ◽

Feature Fusion ◽

State Of The Art ◽

Feature Representation ◽

Detection Accuracy ◽

Dense Network ◽

Remote Sensing Images ◽

Composite Object ◽

Detection Algorithms

The detection of primary and secondary schools (PSSs) is a meaningful task for composite object detection in remote sensing images (RSIs). As a typical composite object in RSIs, PSSs have diverse appearances with complex backgrounds, which makes it difficult to effectively extract their features using the existing deep-learning-based object detection algorithms. Aiming at the challenges of PSSs detection, we propose an end-to-end framework called the attention-guided dense network (ADNet), which can effectively improve the detection accuracy of PSSs. First, a dual attention module (DAM) is designed to enhance the ability in representing complex characteristics and alleviate distractions in the background. Second, a dense feature fusion module (DFFM) is built to promote attention cues flow into low layers, which guides the generation of hierarchical feature representation. Experimental results demonstrate that our proposed method outperforms the state-of-the-art methods and achieves 79.86% average precision. The study proves the effectiveness of our proposed method on PSSs detection.

Download Full-text

Progressive Feature Polishing Network for Salient Object Detection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6892 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12128-12135 ◽

Cited By ~ 1

Author(s):

Bo Wang ◽

Quan Chen ◽

Min Zhou ◽

Zhiqiang Zhang ◽

Xiaogang Jin ◽

...

Keyword(s):

Object Detection ◽

State Of The Art ◽

Hierarchical Structures ◽

Salient Object Detection ◽

Salient Object ◽

Post Processing ◽

Feature Maps ◽

Multiple Feature ◽

Benchmark Datasets ◽

Multi Level

Feature matters for salient object detection. Existing methods mainly focus on designing a sophisticated structure to incorporate multi-level features and filter out cluttered features. We present Progressive Feature Polishing Network (PFPN), a simple yet effective framework to progressively polish the multi-level features to be more accurate and representative. By employing multiple Feature Polishing Modules (FPMs) in a recurrent manner, our approach is able to detect salient objects with fine details without any post-processing. A FPM parallelly updates the features of each level by directly incorporating all higher level context information. Moreover, it can keep the dimensions and hierarchical structures of the feature maps, which makes it flexible to be integrated with any CNN-based models. Empirical experiments show that our results are monotonically getting better with increasing number of FPMs. Without bells and whistles, PFPN outperforms the state-of-the-art methods significantly on five benchmark datasets under various evaluation metrics. Our code is available at: https://github.com/chenquan-cq/PFPN.

Download Full-text

Object Relocation Visual Tracking Based On Histogram Filter And Siamese Network

10.21203/rs.3.rs-1201475/v1 ◽

2022 ◽

Author(s):

Jianlong Zhang ◽

Qiao Li ◽

Bin Wang ◽

Chen Chen ◽

Tianhong Wang ◽

...

Keyword(s):

Visual Tracking ◽

Image Matching ◽

Euclidean Distance ◽

State Of The Art ◽

Feature Representation ◽

Tracking Accuracy ◽

Match Filter ◽

Siamese Network ◽

Matching Process ◽

Dynamic Template

Abstract Siamese network based trackers formulate the visual tracking mission as an image matching process by regression and classification branches, which simplifies the network structure and improves tracking accuracy. However, there remain many problems as described below. 1) The lightweight neural networks decreases feature representation ability. The tracker is easy to fail under the disturbing distractors (e.g., deformation and similar objects) or large changes in viewing angle. 2) The tracker cannot adapt to variations of the object. 3) The tracker cannot reposition the object that has failed to track. To address these issues, we first propose a novel match filter arbiter based on the Euclidean distance histogram between the centers of multiple candidate objects to automatically determine whether the tracker fails. Secondly, Hopcroft-Karp algorithm is introduced to select the winners from the dynamic template set through the backtracking process, and object relocation is achieved by comparing the Gradient Magnitude Similarity Deviation between the template and the winners. The experiments show that our method obtains better performance on several tracking benchmarks, i.e., OTB100, VOT2018, GOT-10k and LaSOT, compared with state-of-the-art methods.

Download Full-text

Vehicle and Pedestrian Detection Based on Multi-level Feature Fusion in Autonomous Driving

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813666200304123323 ◽

2020 ◽

Vol 13 ◽

Author(s):

Chen Guoqiang ◽

Yi Huailong ◽

Mao Zhuangzhuang

Keyword(s):

Autonomous Vehicles ◽

Feature Fusion ◽

Pedestrian Detection ◽

Autonomous Driving ◽

Seasonal Effects ◽

Detection Accuracy ◽

Semantic Features ◽

Feature Maps ◽

Safe Driving ◽

Multi Level

Aims: The factors including light, weather, dynamic objects, seasonal effects and structures bring great challenges for the autonomous driving algorithm in the real world. Autonomous vehicles can detect different object obstacles in complex scenes to ensure safe driving. Background: The ability to detect vehicles and pedestrians is critical to the safe driving of autonomous vehicles. Automated vehicle vision systems must handle extremely wide and challenging scenarios. Objective: The goal of the work is to design a robust detector to detect vehicles and pedestrians. The main contribution is that the Multi-level Feature Fusion Block (MFFB) and the Detector Cascade Block (DCB) are designed. The multi-level feature fusion and multi-step prediction are used which greatly improve the detection object precision. Methods: The paper proposes a vehicle and pedestrian object detector, which is an end-to-end deep convolutional neural network. The key parts of the paper are to design the Multi-level Feature Fusion Block (MFFB) and Detector Cascade Block (DCB). The former combines inherent multi-level features by combining contextual information with useful multi-level features that combine high resolution but low semantics and low resolution but high semantic features. The latter uses multi-step prediction, cascades a series of detectors, and combines predictions of multiple feature maps to handle objects of different sizes. Results: The experiments on the RobotCar dataset and the KITTI dataset show that our algorithm can achieve high precision results through real-time detection. The algorithm achieves 84.61% mAP on the RobotCar dataset and is evaluated on the well-known KITTI benchmark dataset, achieving 81.54% mAP. In particular, the detection accuracy of a single-category vehicle reaches 90.02%. Conclusion: The experimental results show that the proposed algorithm has a good trade-off between detection accuracy and detection speed, which is beyond the current state-of-the-art RefineDet algorithm. The 2D object detector is proposed in the paper, which can solve the problem of vehicle and pedestrian detection and improve the accuracy, robustness and generalization ability in autonomous driving.

Download Full-text

Enhanced Feature Representation in Detection for Optical Remote Sensing Images

Remote Sensing ◽

10.3390/rs11182095 ◽

2019 ◽

Vol 11 (18) ◽

pp. 2095 ◽

Cited By ~ 4

Author(s):

Kun Fu ◽

Zhuo Chen ◽

Yue Zhang ◽

Xian Sun

Keyword(s):

Remote Sensing ◽

State Of The Art ◽

Computational Cost ◽

Feature Representation ◽

Detection Accuracy ◽

Optical Remote Sensing ◽

Remote Sensing Images ◽

Two Stage ◽

Multi Scale ◽

One Stage

In recent years, deep learning has led to a remarkable breakthrough in object detection in remote sensing images. In practice, two-stage detectors perform well regarding detection accuracy but are slow. On the other hand, one-stage detectors integrate the detection pipeline of two-stage detectors to simplify the detection process, and are faster, but with lower detection accuracy. Enhancing the capability of feature representation may be a way to improve the detection accuracy of one-stage detectors. For this goal, this paper proposes a novel one-stage detector with enhanced capability of feature representation. The enhanced capability benefits from two proposed structures: dual top-down module and dense-connected inception module. The former efficiently utilizes multi-scale features from multiple layers of the backbone network. The latter both widens and deepens the network to enhance the ability of feature representation with limited extra computational cost. To evaluate the effectiveness of proposed structures, we conducted experiments on horizontal bounding box detection tasks on the challenging DOTA dataset and gained 73.49% mean Average Precision (mAP), achieving state-of-the-art performance. Furthermore, our method ran significantly faster than the best public two-stage detector on the DOTA dataset.

Download Full-text

Low-Light Image Enhancement Based on Multi-Path Interaction

Sensors ◽

10.3390/s21154986 ◽

2021 ◽

Vol 21 (15) ◽

pp. 4986

Author(s):

Bai Zhao ◽

Xiaolin Gong ◽

Jian Wang ◽

Lingchao Zhao

Keyword(s):

Color Image ◽

State Of The Art ◽

Interaction Network ◽

Feature Representation ◽

Uniform Illumination ◽

Feature Maps ◽

Low Light ◽

Low Contrast ◽

Better Than

Due to the non-uniform illumination conditions, images captured by sensors often suffer from uneven brightness, low contrast and noise. In order to improve the quality of the image, in this paper, a multi-path interaction network is proposed to enhance the R, G, B channels, and then the three channels are combined into the color image and further adjusted in detail. In the multi-path interaction network, the feature maps in several encoding–decoding subnetworks are used to exchange information across paths, while a high-resolution path is retained to enrich the feature representation. Meanwhile, in order to avoid the possible unnatural results caused by the separation of the R, G, B channels, the output of the multi-path interaction network is corrected in detail to obtain the final enhancement results. Experimental results show that the proposed method can effectively improve the visual quality of low-light images, and the performance is better than the state-of-the-art methods.

Download Full-text

Detection of Malicious Spatial-Domain Steganography over Noisy Channels Using Convolutional Neural Networks

Electronic Imaging ◽

10.2352/issn.2470-1173.2020.4.mwsf-076 ◽

2020 ◽

Vol 2020 (4) ◽

pp. 76-1-76-7

Author(s):

Swaroop Shankar Prasad ◽

Ofer Hadar ◽

Ilia Polian

Keyword(s):

State Of The Art ◽

Visual Quality ◽

Channel Noise ◽

Detection Accuracy ◽

Noisy Channel ◽

Noisy Channels ◽

Reliable Transmission ◽

Reliable Detection ◽

Natural Noise ◽

Will Force

Image steganography can have legitimate uses, for example, augmenting an image with a watermark for copyright reasons, but can also be utilized for malicious purposes. We investigate the detection of malicious steganography using neural networkbased classification when images are transmitted through a noisy channel. Noise makes detection harder because the classifier must not only detect perturbations in the image but also decide whether they are due to the malicious steganographic modifications or due to natural noise. Our results show that reliable detection is possible even for state-of-the-art steganographic algorithms that insert stego bits not affecting an image’s visual quality. The detection accuracy is high (above 85%) if the payload, or the amount of the steganographic content in an image, exceeds a certain threshold. At the same time, noise critically affects the steganographic information being transmitted, both through desynchronization (destruction of information which bits of the image contain steganographic information) and by flipping these bits themselves. This will force the adversary to use a redundant encoding with a substantial number of error-correction bits for reliable transmission, making detection feasible even for small payloads.

Download Full-text

Smoke recognition network based on dynamic characteristics

International Journal of Advanced Robotic Systems ◽

10.1177/1729881420925662 ◽

2020 ◽

Vol 17 (3) ◽

pp. 172988142092566

Author(s):

Dahan Wang ◽

Sheng Luo ◽

Li Zhao ◽

Xiaoming Pan ◽

Muchou Wang ◽

...

Keyword(s):

Dynamic Characteristics ◽

State Of The Art ◽

The State ◽

Detection Accuracy ◽

Static Characteristics ◽

Good Tool ◽

Early Signal ◽

Fuzzy Objects ◽

The Difference ◽

Smoke Recognition

Fire is a fierce disaster, and smoke is the early signal of fire. Since such features as chrominance, texture, and shape of smoke are very special, a lot of methods based on these features have been developed. But these static characteristics vary widely, so there are some exceptions leading to low detection accuracy. On the other side, the motion of smoke is much more discriminating than the aforementioned features, so a time-domain neural network is proposed to extract its dynamic characteristics. This smoke recognition network has these advantages:(1) extract the spatiotemporal with the 3D filters which work on dynamic and static characteristics synchronously; (2) high accuracy, 87.31% samples being classified rightly, which is the state of the art even in a chaotic environments, and the fuzzy objects for other methods, such as haze, fog, and climbing cars, are distinguished distinctly; (3) high sensitiveness, smoke being detected averagely at the 23rd frame, which is also the state of the art, which is meaningful to alarm early fire as soon as possible; and (4) it is not been based on any hypothesis, which guarantee the method compatible. Finally, a new metric, the difference between the first frame in which smoke is detected and the first frame in which smoke happens, is proposed to compare the algorithms sensitivity in videos. The experiments confirm that the dynamic characteristics are more discriminating than the aforementioned static characteristics, and smoke recognition network is a good tool to extract compound feature.

Download Full-text