File(s) under permanent embargo
Learning Spatial Fusion and Matching for Visual Object Tracking
Siamese network based trackers have achieved outstanding performance in visual object tracking, which in essence is the application of the efficient cross-correlation as the matching function. However, it is experimentally found that the cross-correlation based matching function is difficult to generate accurate tracking results in some challenging environments, such as background clutters and fast motion. Thus, a new Siamese-based tracker named SiamFAM is proposed. Specifically, from the perspective of feature fusion, a new matching function named Concatenation is introduced into our tracker, which can reduce the influence of background clutters by fine-grained matching with little computational overhead. Meanwhile, an adaptively spatial feature fusion (ASFF) module is proposed, which can take full use of multi-layer features and reduce poor prediction results during the prediction process. In addition, a refinement module is adopted to reduce the occurrence of tracking drift. Extensive experiments are conducted on six challenging benchmarks, including VOT2016, VOT2019, UAV123, NFS, OTB100, and LaSOT, demonstrating that our tracker is practical and can achieve a leading performance.