An Enhanced Visual Attention Siamese Network That Updates Template Features Online
Recently, Siamese trackers have attracted extensive attention because of their simplicity and low computational cost. However, for most Siamese trackers, only a frame of the video sequence is used as the template, and the template is not updated in inference process, which makes the tracking success rate inferior to the trackers that can update the template online. In the current study, we introduce an enhanced visual attention Siamese network (ESA-Siam). The method is based on a deep convolutional neural network, which integrates channel attention and spatial self-attention to improve the discriminative ability of the tracker for positive and negative samples. Channel attention reflects different targets according to the response value of different channels to achieve better target representation. Spatial self-attention captures the correlation between two arbitrary positions to help locate the target. At the same time, a template search attention module is designed to implicitly update the template features online, which can effectively improve the success rate of the tracker when the target is interfered by the background. The proposed ESA-Siam tracker shows superior performance compared with 18 existing state-of-the-art trackers on five benchmark datasets including OTB50, OTB100, VOT2016, VOT2018, and LaSOT.