In the field of forestry ecology, image data capture factual information, while literature is rich with expert knowledge. The corpus within the literature can provide expert-level annotations for images, and the visual information within images naturally serves as a clustering center for the textual corpus. However, both image data and literature represent large and rapidly growing, unstructured datasets of heterogeneous modalities. To address this challenge, we propose cross-modal embedding clustering, a method that parameterizes these datasets using a deep learning model with relatively few annotated samples. This approach offers a means to retrieve relevant factual information and expert knowledge from the database of images and literature through a question-answering mechanism. Specifically, we align images and literature across modalities using a pair of encoders, followed by cross-modal information fusion, and feed these data into an autoregressive generative language model for question-answering with user feedback. Experiments demonstrate that this cross-modal clustering method enhances the performance of image recognition, cross-modal retrieval, and cross-modal question-answering models. Our method achieves superior performance on standardized tasks in public datasets for image recognition, cross-modal retrieval, and cross-modal question-answering, notably achieving a 21.94% improvement in performance on the cross-modal question-answering task of the ScienceQA dataset, thereby validating the efficacy of our approach. Essentially, our method targets cross-modal information fusion, combining perspectives from multiple tasks and utilizing cross-modal representation clustering of images and text. This approach effectively addresses the interdisciplinary complexity of forestry ecology literature and the parameterization of unstructured heterogeneous data encapsulating species diversity in conservation images. Building on this foundation, intelligent methods are employed to leverage large-scale data, providing an intelligent research assistant tool for conducting forestry ecological studies on larger temporal and spatial scales.