A Deep Hyper Siamese Network for Real-Time Object Tracking

Siamese networks have drawn increasing interest in the field of visual object tracking due to their balance of precision and efficiency. However, Siamese trackers use relatively shallow backbone networks, such as AlexNet, and therefore do not take full advantage of the capabilities of modern deep convolutional neural networks (CNNs). Moreover, the feature representations of the target object in a Siamese tracker are extracted through the last layer of CNNs and mainly capture semantic information, which causes the tracker's precision to be relatively low and to drift easily in the presence of similar distractors. In this paper, a new nonpadding residual unit (NPRU) is designed and used to stack a 22-layer deep ResNet, referred as ResNet22. After utilizing ResNet22 as the backbone network, we can build a deep Siamese network, which can greatly enhance the tracking performance. Considering that the different levels of the feature maps of the CNN represent different aspects of the target object, we aggregated different deep convolutional layers to make use of ResNet22's multilevel feature maps, which can form hyperfeature representations of targets. The designed deep hyper Siamese network is named DHSiam. Experimental results show that DHSiam has achieved significant improvement on multiple benchmark datasets.

Download Full-text

Residual Augmented Attentional U-Shaped Network for Spectral Reconstruction from RGB Images

Remote Sensing ◽

10.3390/rs13010115 ◽

2020 ◽

Vol 13 (1) ◽

pp. 115

Author(s):

Jiaojiao Li ◽

Chaoxiong Wu ◽

Rui Song ◽

Yunsong Li ◽

Weiying Xie

Keyword(s):

Superior Performance ◽

Feature Maps ◽

Deep Convolutional Neural Networks ◽

Second Order Statistics ◽

Feature Representations ◽

Quantitative Measurements ◽

Spectral Reconstruction ◽

Perceptual Comparison ◽

Benchmark Datasets ◽

Rgb Images

Deep convolutional neural networks (CNNs) have been successfully applied to spectral reconstruction (SR) and acquired superior performance. Nevertheless, the existing CNN-based SR approaches integrate hierarchical features from different layers indiscriminately, lacking an investigation of the relationships of intermediate feature maps, which limits the learning power of CNNs. To tackle this problem, we propose a deep residual augmented attentional u-shape network (RA2UN) with several double improved residual blocks (DIRB) instead of paired plain convolutional units. Specifically, a trainable spatial augmented attention (SAA) module is developed to bridge the encoder and decoder to emphasize the features in the informative regions. Furthermore, we present a novel channel augmented attention (CAA) module embedded in the DIRB to rescale adaptively and enhance residual learning by using first-order and second-order statistics for stronger feature representations. Finally, a boundary-aware constraint is employed to focus on the salient edge information and recover more accurate high-frequency details. Experimental results on four benchmark datasets demonstrate that the proposed RA2UN network outperforms the state-of-the-art SR methods under quantitative measurements and perceptual comparison.

Download Full-text

Siamese anchor-free object tracking with multiscale spatial attentions

Scientific Reports ◽

10.1038/s41598-021-02095-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Jianming Zhang ◽

Benben Huang ◽

Zi Ye ◽

Li-Dan Kuang ◽

Xin Ning

Keyword(s):

Object Tracking ◽

Spatial Information ◽

Target Position ◽

Visual Object ◽

Feature Maps ◽

Free Object ◽

Widespread Application ◽

Great Performance ◽

Free Classification ◽

Siamese Networks

AbstractRecently, object trackers based on Siamese networks have attracted considerable attentions due to their remarkable tracking performance and widespread application. Especially, the anchor-based methods exploit the region proposal subnetwork to get accurate prediction of a target and make great performance improvement. However, those trackers cannot capture the spatial information very well and the pre-defined anchors will hinder robustness. To solve these problems, we propose a Siamese-based anchor-free object tracking algorithm with multiscale spatial attentions in this paper. Firstly, we take ResNet-50 as the backbone network to generate multiscale features of both template patch and search regions. Secondly, we propose the spatial attention extraction (SAE) block to capture the spatial information among all positions in the template and search region feature maps. Thirdly, we put these features into the SAE block to get the multiscale spatial attentions. Finally, an anchor-free classification and regression subnetwork is used for predicting the location of the target. Unlike anchor-based methods, our tracker directly predicts the target position without predefined parameters. Extensive experiments with state-of-the-art trackers are carried out on four challenging visual object tracking benchmarks: OTB100, UAV123, VOT2016 and GOT-10k. Those experimental results confirm the effectiveness of our proposed tracker.

Download Full-text

Attention Modulated Multiple Object Tracking with Motion Enhancement and Dual Correlation

Symmetry ◽

10.3390/sym13020266 ◽

2021 ◽

Vol 13 (2) ◽

pp. 266 ◽

Cited By ~ 1

Author(s):

Yifeng Wang ◽

Zhijiang Zhang ◽

Ning Zhang ◽

Dan Zeng

Keyword(s):

Object Tracking ◽

Multiple Object Tracking ◽

Tracking Accuracy ◽

Feature Maps ◽

Backbone Networks ◽

Multiple Object ◽

The Arts ◽

Training Stage ◽

The One ◽

Baseline State

The one-shot multiple object tracking (MOT) framework has drawn more and more attention in the MOT research community due to its advantage in inference speed. However, the tracking accuracy of current one-shot approaches could lead to an inferior performance compared with their two-stage counterparts. The reasons are two-fold: one is that motion information is often neglected due to the single-image input. The other is that detection and re-identification (ReID) are two different tasks with different focuses. Joining detection and re-identification at the training stage could lead to a suboptimal performance. To alleviate the above limitations, we propose a one-shot network named Motion and Correlation-Multiple Object Tracking (MAC-MOT). MAC-MOT introduces a motion enhance attention module (MEA) and a dual correlation attention module (DCA). MEA performs differences on adjacent feature maps which enhances the motion-related features while suppressing irrelevant information. The DCA module focuses on decoupling the detection task and re-identification task to strike a balance and reduce the competition between these two tasks. Moreover, symmetry is a core design idea in our proposed framework which is reflected in Siamese-based deep learning backbone networks, the input of dual stream images, as well as a dual correlation attention module. Our proposed approach is evaluated on the popular multiple object tracking benchmarks MOT16 and MOT17. We demonstrate that the proposed MAC-MOT can achieve a better performance than the baseline state of the arts (SOTAs).

Download Full-text

Adaptive Channel Selection for Robust Visual Object Tracking with Discriminative Correlation Filters

International Journal of Computer Vision ◽

10.1007/s11263-021-01435-1 ◽

2021 ◽

Author(s):

Tianyang Xu ◽

Zhenhua Feng ◽

Xiao-Jun Wu ◽

Josef Kittler

Keyword(s):

Object Tracking ◽

Augmented Lagrangian Method ◽

Channel Selection ◽

Image Feature ◽

Superior Performance ◽

Appearance Model ◽

Visual Object ◽

Correlation Filters ◽

Visual Object Tracking ◽

Feature Representations

AbstractDiscriminative Correlation Filters (DCF) have been shown to achieve impressive performance in visual object tracking. However, existing DCF-based trackers rely heavily on learning regularised appearance models from invariant image feature representations. To further improve the performance of DCF in accuracy and provide a parsimonious model from the attribute perspective, we propose to gauge the relevance of multi-channel features for the purpose of channel selection. This is achieved by assessing the information conveyed by the features of each channel as a group, using an adaptive group elastic net inducing independent sparsity and temporal smoothness on the DCF solution. The robustness and stability of the learned appearance model are significantly enhanced by the proposed method as the process of channel selection performs implicit spatial regularisation. We use the augmented Lagrangian method to optimise the discriminative filters efficiently. The experimental results obtained on a number of well-known benchmarking datasets demonstrate the effectiveness and stability of the proposed method. A superior performance over the state-of-the-art trackers is achieved using less than $$10\%$$ 10 % deep feature channels.

Download Full-text

An Anchor-Free Siamese Network with Multi-Template Update for Object Tracking

Electronics ◽

10.3390/electronics10091067 ◽

2021 ◽

Vol 10 (9) ◽

pp. 1067

Author(s):

Tongtong Yuan ◽

Wenzhu Yang ◽

Qian Li ◽

Yuxia Wang

Keyword(s):

Object Tracking ◽

Correlation Energy ◽

Feature Maps ◽

Siamese Network ◽

Template Update ◽

Free Network ◽

Multiple Prediction ◽

Bounding Boxes ◽

High Level ◽

Speed And Accuracy

Siamese trackers are widely used in various fields for their advantages of balancing speed and accuracy. Compared with the anchor-based method, the anchor-free-based approach can reach faster speeds without any drop in precision. Inspired by the Siamese network and anchor-free idea, an anchor-free Siamese network (AFSN) with multi-template updates for object tracking is proposed. To improve tracking performance, a dual-fusion method is adopted in which the multi-layer features and multiple prediction results are combined respectively. The low-level feature maps are concatenated with the high-level feature maps to make full use of both spatial and semantic information. To make the results as stable as possible, the final results are obtained by combining multiple prediction results. Aiming at the template update, a high-confidence multi-template update mechanism is used. The average peak to correlation energy is used to determine whether the template should be updated. We use the anchor-free network to implement object tracking in a per-pixel manner, which computes the object category and bounding boxes directly. Experimental results indicate that the average overlap and success rate of the proposed algorithm increase by about 5% and 10%, respectively, compared to the SiamRPN++ algorithm when running on the dataset of GOT-10k (Generic Object Tracking Benchmark).

Download Full-text

Multiple Context Features in Siamese Networks for Visual Object Tracking

Lecture Notes in Computer Science - Computer Vision – ECCV 2018 Workshops ◽

10.1007/978-3-030-11009-3_6 ◽

2019 ◽

pp. 116-131

Author(s):

Henrique Morimitsu

Keyword(s):

Object Tracking ◽

Visual Object ◽

Visual Object Tracking ◽

Multiple Context ◽

Context Features ◽

Siamese Networks

Download Full-text

Distractor-Aware Siamese Networks for Visual Object Tracking

Computer Vision – ECCV 2018 - Lecture Notes in Computer Science ◽

10.1007/978-3-030-01240-3_7 ◽

2018 ◽

pp. 103-119 ◽

Cited By ~ 130

Author(s):

Zheng Zhu ◽

Qiang Wang ◽

Bo Li ◽

Wei Wu ◽

Junjie Yan ◽

...

Keyword(s):

Object Tracking ◽

Visual Object ◽

Visual Object Tracking ◽

Siamese Networks

Download Full-text

Robust real-time visual object tracking via multi-scale fully convolutional Siamese networks

Multimedia Tools and Applications ◽

10.1007/s11042-018-5664-7 ◽

2018 ◽

Vol 77 (17) ◽

pp. 22131-22143 ◽

Cited By ~ 4

Author(s):

Longchao Yang ◽

Peilin Jiang ◽

Fei Wang ◽

Xuan Wang

Keyword(s):

Object Tracking ◽

Real Time ◽

Visual Object ◽

Visual Object Tracking ◽

Multi Scale ◽

Siamese Networks

Download Full-text

Complementary Object Tracking Using Average Peak-to-Correlation Energy

10.3233/faia210046 ◽

2021 ◽

Author(s):

Kosuke Honda ◽

Hamido Fujita

Keyword(s):

Neural Networks ◽

Object Tracking ◽

Convolutional Neural Networks ◽

Correlation Energy ◽

Target Object ◽

The Other ◽

Tracking Performance ◽

Correlation Filter ◽

Evaluation Index ◽

Siamese Network

In recent years, template-based methods such as Siamese network trackers and Correlation Filter (CF) based trackers have achieved state-of-the-art performance in several benchmarks. Recent Siamese network trackers use deep features extracted from convolutional neural networks to locate the target. However, the tracking performance of these trackers decreases when there are similar distractors to the object and the target object is deformed. On the other hand, correlation filter (CF)-based trackers that use handcrafted features (e.g., HOG features) to spatially locate the target. These two approaches have complementary characteristics due to differences in learning methods, features used, and the size of search regions. Also, we found that these trackers are complementary in terms of performance in benchmarking. Therefore, we propose the “Complementary Tracking framework using Average peak-to-correlation energy” (CTA). CTA is the generic object tracking framework that connects CF-trackers and Siamese-trackers in parallel and exploits the complementary features of these. In CTA, when a tracking failure of the Siamese tracker is detected using Average peak-to-correlation energy (APCE), which is an evaluation index of the response map matrix, the CF-trackers correct the output. In experimental on OTB100, CTA significantly improves the performance over the original tracker for several combinations of Siamese-trackers and CF-rackers.

Download Full-text

Learning Soft Mask Based Feature Fusion with Channel and Spatial Attention for Robust Visual Object Tracking

Sensors ◽

10.3390/s20144021 ◽

2020 ◽

Vol 20 (14) ◽

pp. 4021 ◽

Cited By ~ 2

Author(s):

Mustansar Fiaz ◽

Arif Mahmood ◽

Soon Ki Jung

Keyword(s):

Object Tracking ◽

Spatial Attention ◽

Feature Fusion ◽

State Of The Art ◽

Feature Representation ◽

Visual Object ◽

Target Feature ◽

Visual Object Tracking ◽

Low Level ◽

Benchmark Datasets

We propose to improve the visual object tracking by introducing a soft mask based low-level feature fusion technique. The proposed technique is further strengthened by integrating channel and spatial attention mechanisms. The proposed approach is integrated within a Siamese framework to demonstrate its effectiveness for visual object tracking. The proposed soft mask is used to give more importance to the target regions as compared to the other regions to enable effective target feature representation and to increase discriminative power. The low-level feature fusion improves the tracker robustness against distractors. The channel attention is used to identify more discriminative channels for better target representation. The spatial attention complements the soft mask based approach to better localize the target objects in challenging tracking scenarios. We evaluated our proposed approach over five publicly available benchmark datasets and performed extensive comparisons with 39 state-of-the-art tracking algorithms. The proposed tracker demonstrates excellent performance compared to the existing state-of-the-art trackers.

Download Full-text