ABSTRACT The direct application of large language models (LLMs) to specific domain tasks frequently encounters challenges due to the scarcity of domain data, variations in domain semantics, and the complexity of domain knowledge. Further pretraining of advanced foundational models on extensive domain-specific corpora can infuse these models with domain-specific knowledge and enhancing their ability of solving domain-specific tasks. However, the development of most domain-specific models focuses primarily on collecting large-scale domain data, often overlooking the crucial optimization of the pre-pretraining stage, which significantly impacts both model performance and training efficiency. This paper introduces PRE -PretrAining FRamEwork for Domain-specific Large Language Models (PreparedLLM), a framework designed to enhance the pre-pretraining stage for domain specialization of LLMs. PreparedLLM employs advanced techniques in data recipe, data cleaning, vocabulary expansion, and embedding initialization. These techniques are implemented to optimize both the composition and quality of the training data, and to enhance understanding of domain terminologies and concepts, as well as improve token embedding initialization. Utilizing the geoscience domain as a case study, this paper applies PreparedLLM for the domain specialization of the Llama, a widely recognized general-purpose LLM. Experimental results demonstrate that PreparedLLM enhances model convergence speed, training speed, inference speed, the text volume of the context window, and overall performance in domain specialization. The utilization of PreparedLLM in developing domain-specific LLMs has significantly increased performance while reducing both time and resource investment. The case study provides valuable insights into the development of domain-specific LLMs.
Read full abstract