In the field of Natural Language Processing (NLP), the lack of support for minority languages, especially Uyghur, the scarcity of Uyghur language corpora in the agricultural domain, and the lightweight nature of large language models remain prominent issues. This study proposes a method for constructing a bilingual (Uyghur and Chinese) lightweight specialized large language model for the agricultural domain. By utilizing a mixed training approach of Uyghur and Chinese, we extracted Chinese corpus text from agricultural-themed books in PDF format using OCR (Optical Character Recognition) technology, converted the Chinese text corpus into a Uyghur corpus using a rapid translation API, and constructed a bilingual mixed vocabulary. We applied the parameterized Transformer model algorithm to train the model for the agricultural domain in both Chinese and Uyghur. Furthermore, we introduced a context detection and fail-safe mechanism for the generated text. The constructed model possesses the ability to support bilingual reasoning in Uyghur and Chinese in the agricultural domain, with higher accuracy and a smaller size that requires less hardware. It (our work) addresses issues such as the scarcity of Uyghur corpora in the agricultural domain, mixed word segmentation and word vector modeling in Uyghur for widespread agricultural languages, model lightweighting and deployment, and the fragmentation of non-relevant texts during knowledge extraction from small-scale corpora. The lightweight design of the model reduces hardware requirements, facilitating deployment in resource-constrained environments. This advancement promotes agricultural intelligence, aids in the development of specific applications and minority languages (such as agriculture and Uyghur), and contributes to rural revitalization.
Read full abstract