Abstract

In the field of Natural Language Processing (NLP), the lack of support for minority languages, especially Uyghur, the scarcity of Uyghur language corpora in the agricultural domain, and the lightweight nature of large language models remain prominent issues. This study proposes a method for constructing a bilingual (Uyghur and Chinese) lightweight specialized large language model for the agricultural domain. By utilizing a mixed training approach of Uyghur and Chinese, we extracted Chinese corpus text from agricultural-themed books in PDF format using OCR (Optical Character Recognition) technology, converted the Chinese text corpus into a Uyghur corpus using a rapid translation API, and constructed a bilingual mixed vocabulary. We applied the parameterized Transformer model algorithm to train the model for the agricultural domain in both Chinese and Uyghur. Furthermore, we introduced a context detection and fail-safe mechanism for the generated text. The constructed model possesses the ability to support bilingual reasoning in Uyghur and Chinese in the agricultural domain, with higher accuracy and a smaller size that requires less hardware. It (our work) addresses issues such as the scarcity of Uyghur corpora in the agricultural domain, mixed word segmentation and word vector modeling in Uyghur for widespread agricultural languages, model lightweighting and deployment, and the fragmentation of non-relevant texts during knowledge extraction from small-scale corpora. The lightweight design of the model reduces hardware requirements, facilitating deployment in resource-constrained environments. This advancement promotes agricultural intelligence, aids in the development of specific applications and minority languages (such as agriculture and Uyghur), and contributes to rural revitalization.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.