MixFormerv2:Efficient Fully Transformer Tracking

Transformer-based trackers have achieved high accuracy on standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to mitigate this issue, we propose a fully transformer tracking framework based on the successful MixFormer tracker [14], coined as MixFormerV2, without any dense convolutional operations or complex score prediction modules. We introduce four special prediction tokens and concatenate them with those from the target template and search area. Then, we apply a simple transformer backbone on this mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter is for pruning the backbone layers.

arxiv.org