Animal re-identification is fundamental in the ecology and ethology community for understanding the Earth’s ecosystems. Due to the uncertainty of photographing animals in the wild, such as the position and angle of capture and the variations in animal individuals as well as in their poses and habitats, building image-based visual models for re-identifying animal individuals is a challenging task. We propose a novel framework, hierarchical spatial-frequency fusion transformers, to address animal re-identification. We first use spatial and frequency representation learning to capture effective deep features and then employ hierarchical transformer-based representation learning to consolidate low-level detailed information into high-level semantic information from a global perspective. Through this process, we construct a deeply supervised nonlinear aggregation method to enhance finer multi-scale, multi-level and cross-domain features. Our method illustrates the case of “do the best of all together”, meaning that only when both the frequency and spatial domains are combined can we achieve the best performance. The experimental results demonstrate that our approach achieves significantly higher performance than other state-of-the-art methods.
Read full abstract