With artificial intelligence and deep learning development, crop disease recognition methods leverage deep networks for automatic feature learning but rely on the volume of training data. Addressing the scarcity of data in agriculture, few-shot learning (FSL) and multi-modal learning have become focal points. However, existing methods are confined to a single modality or insufficiently exploit cross-modal features. To address this, we propose a multi-modal contrastive learning approach integrating images and text to tackle the problem of small-sample recognition. This method combines CLIP multi-modal pre-training with cross-attention, termed ITIMCA. Experimental validation demonstrates the effectiveness of our approach in cassava leaf disease recognition tasks under natural conditions. Experimental results show that the proposed model achieved an accuracy of 78.00%, a precision of 88.48%, a recall rate of 80.00%, and an F1-Score of 79.00% on the cassava leaf disease identification and classification dataset. These results suggest that the proposed network effectively identifies cassava leaf diseases.
Read full abstract