Visual recognition methods based on deep convolutional neural networks have performed well in pest diagnosis and have gradually become a research hotspot. However, agricultural pest recognition faces challenges such as few-shot learning, category imbalance, similarity in appearance, and small pest targets. Existing deep learning-based pest recognition methods typically rely solely on unimodal image data, which results in a model whose recognition performance is heavily dependent on the size and quality of the annotated training dataset. However, the construction of large-scale, high-quality pest datasets requires significant economic and technical costs, limiting the practical generalization of existing methods for pest recognition. To address these challenges, this paper proposes a few-shot pest recognition model called MMAE (multimodal masked autoencoder). Firstly, the masked autoencoder of MMAE integrates self-supervised learning, which can be applied to few-shot datasets and improves recognition accuracy. Secondly, MMAE embeds textual modal information on top of image modal information, thus improving the performance of pest recognition by utilizing the correlation and complementarity between the two modalities. The experimental results show that MMAE is the most effective for pest identification compared with the existing excellent models, and the identification accuracy is as high as 98.12%, which is 1.61 percentage points higher than the current state-of-the-art MAE method. The work in this paper shows that the introduction of textual information can assist the visual coder in capturing agricultural pest characterization information at a higher level of granularity, providing a methodological reference for solving the problem of agricultural pest recognition under few-shot conditions.
Read full abstract