Cross-modal retrieval for rice leaf diseases is crucial for prevention, providing agricultural experts with data-driven decision support to address disease threats and safeguard rice production. To overcome the limitations of current crop leaf disease retrieval frameworks, we focused on four common rice leaf diseases and established the first cross-modal rice leaf disease retrieval dataset (CRLDRD). We introduced cross-modal retrieval to the domain of rice leaf disease retrieval and introduced FHTW-Net, a framework for rice leaf disease image-text retrieval. To address the challenge of matching diverse image categories with complex text descriptions during the retrieval process, we initially employed ViT and BERT to extract fine-grained image and text feature sequences enriched with contextual information. Subsequently, two-way mixed self-attention (TMS) was introduced to enhance both image and text feature sequences, with the aim of uncovering important semantic information in both modalities. Then, we developed false-negative elimination-hard negative mining (FNE-HNM) strategy to facilitate in-depth exploration of semantic connections between different modalities. This strategy aids in selecting challenging negative samples for elimination to constrain the model within the triplet loss function. Finally, we introduced warm-up bat algorithm (WBA) for learning rate optimization, which improves the model's convergence speed and accuracy. Experimental results demonstrated that FHTW-Net outperforms state-of-the-art models. In image-to-text retrieval, it achieved R@1, R@5, and R@10 accuracies of 83.5%, 92%, and 94%, respectively, while in text-to-image retrieval, it achieved accuracies of 82.5%, 98%, and 98.5%, respectively. FHTW-Net offers advanced technical support and algorithmic guidance for cross-modal retrieval of rice leaf diseases.