Plant and Animal Species Recognition Based on Dynamic Vision Transformer Architecture

Hang Pan,Zhiliang Wang,Lun Xie

doi:10.3390/rs14205242

Hang Pan, Zhiliang Wang + Show 1 more

Open Access

https://doi.org/10.3390/rs14205242

Copy DOI

Journal: Remote sensing	Publication Date: Oct 20, 2022
Citations: 4	License type: CC BY 4.0

Affiliation: University of Science and Technology Beijing

Abstract

Automatic prediction of the plant and animal species most likely to be observed at a given geo-location is useful for many scenarios related to biodiversity management and conservation. However, the sparseness of aerial images results in small discrepancies in the image appearance of different species categories. In this paper, we propose a novel Dynamic Vision Transformer (DViT) architecture to reduce the effect of small image discrepancies for plant and animal species recognition by aerial image and geo-location environment information. We extract the latent representation by sampling a subset of patches with low attention weights in the transformer encoder model with a learnable mask token for multimodal aerial images. At the same time, the geo-location environment information is added to the process of extracting the latent representation from aerial images and fused with the token with high attention weights to improve the distinguishability of representation by the dynamic attention fusion model. The proposed DViT method is evaluated on the GeoLifeCLEF 2021 and 2022 datasets, achieving state-of-the-art performance. The experimental results show that fusing the aerial image and multimodal geo-location environment information contributes to plant and animal species recognition.

Full Text