Pre-training Framework Research Articles

ABSTRACT The direct application of large language models (LLMs) to specific domain tasks frequently encounters challenges due to the scarcity of domain data, variations in domain semantics, and the complexity of domain knowledge. Further pretraining of advanced foundational models on extensive domain-specific corpora can infuse these models with domain-specific knowledge and enhancing their ability of solving domain-specific tasks. However, the development of most domain-specific models focuses primarily on collecting large-scale domain data, often overlooking the crucial optimization of the pre-pretraining stage, which significantly impacts both model performance and training efficiency. This paper introduces PRE -PretrAining FRamEwork for Domain-specific Large Language Models (PreparedLLM), a framework designed to enhance the pre-pretraining stage for domain specialization of LLMs. PreparedLLM employs advanced techniques in data recipe, data cleaning, vocabulary expansion, and embedding initialization. These techniques are implemented to optimize both the composition and quality of the training data, and to enhance understanding of domain terminologies and concepts, as well as improve token embedding initialization. Utilizing the geoscience domain as a case study, this paper applies PreparedLLM for the domain specialization of the Llama, a widely recognized general-purpose LLM. Experimental results demonstrate that PreparedLLM enhances model convergence speed, training speed, inference speed, the text volume of the context window, and overall performance in domain specialization. The utilization of PreparedLLM in developing domain-specific LLMs has significantly increased performance while reducing both time and resource investment. The case study provides valuable insights into the development of domain-specific LLMs.

Finding experts is essential in Community Question Answering (CQA) platforms as it enables the effective routing of questions to potential users who can provide relevant answers. The key is to personalized learning expert representations based on their historical answered questions, and accurately matching them with target questions. Recently, the application of Pre-trained Language Models (PLMs) have gained significant attraction due to their impressive capability to comprehend textual data, and are widespread used across various domains. There have been some preliminary works exploring the usability of PLMs in expert finding, such as pre-training expert or question representations. However, these models usually learn pure text representations of experts from histories, disregarding personalized and fine-grained expert modeling. For alleviating this, we present a personalized pre-training and fine-tuning paradigm, which could effectively learn expert interest and expertise simultaneously. Specifically, in our pre-training framework, we integrate historical answered questions of one expert with one target question, and regard it as a candidate aware expert-level input unit. Then, we fuse expert IDs into the pre-training for guiding the model to model personalized expert representations, which can help capture the unique characteristics and expertise of each individual expert. Additionally, in our pre-training task, we design: 1) a question-level masked language model task to learn the relatedness between histories, enabling the modeling of question-level expert interest; 2) a vote-oriented task to capture question-level expert expertise by predicting the vote score the expert would receive. Through our pre-training framework and tasks, our approach could holistically learn expert representations including interests and expertise. Our method has been extensively evaluated on six real-world CQA datasets, and the experimental results consistently demonstrate the superiority of our approach over competitive baseline methods. 1

Pre-training Framework Research Articles

Related Topics

Articles published on Pre-training Framework

Multi-modal representation learning in retinal imaging using self-supervised learning for enhanced clinical predictions

MF-GSLAE: A Multi-Factor User Representation Pre-training Framework for Dual-Target Cross-Domain Recommendation

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

CMSSP: A Contrastive Mass Spectra-Structure Pretraining Model for Metabolite Identification.

PreparedLLM: effective pre-pretraining framework for domain-specific large language models

PEPT: Expert Finding Meets Personalized Pre-training

Enhancing Task-Oriented Dialogue Modeling through Coreference-Enhanced Contrastive Pre-Training

UniChest: Conquer-and-Divide Pre-Training for Multi-Source Chest X-Ray Classification.

A self-supervised framework for computer-aided arrhythmia diagnosis

A cross-temporal contrastive disentangled model for ancient Chinese understanding

A Task-Generic High-Performance Unsupervised Pre-Training Framework for ECG

MolPLA: a molecular pretraining framework for learning cores, R-groups and their linker joints.

Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA

Graph Contrastive Multi-view Learning: A Pre-training Framework for Graph Classification

Enhancing Molecular Property Prediction through Task-Oriented Transfer Learning: Integrating Universal Structural Insights and Domain-Specific Knowledge.

A Multi-view Molecular Pre-training with Generative Contrastive Learning.

X 2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks.

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

Divide and Conquer: Hybrid Pre-training for Person Search

Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Pre-training Framework Research Articles

Related Topics

Articles published on Pre-training Framework

Multi-modal representation learning in retinal imaging using self-supervised learning for enhanced clinical predictions

MF-GSLAE: A Multi-Factor User Representation Pre-training Framework for Dual-Target Cross-Domain Recommendation

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

CMSSP: A Contrastive Mass Spectra-Structure Pretraining Model for Metabolite Identification.

PreparedLLM: effective pre-pretraining framework for domain-specific large language models

PEPT: Expert Finding Meets Personalized Pre-training

Enhancing Task-Oriented Dialogue Modeling through Coreference-Enhanced Contrastive Pre-Training

UniChest: Conquer-and-Divide Pre-Training for Multi-Source Chest X-Ray Classification.

A self-supervised framework for computer-aided arrhythmia diagnosis

A cross-temporal contrastive disentangled model for ancient Chinese understanding

A Task-Generic High-Performance Unsupervised Pre-Training Framework for ECG

MolPLA: a molecular pretraining framework for learning cores, R-groups and their linker joints.

Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA

Graph Contrastive Multi-view Learning: A Pre-training Framework for Graph Classification

Enhancing Molecular Property Prediction through Task-Oriented Transfer Learning: Integrating Universal Structural Insights and Domain-Specific Knowledge.

A Multi-view Molecular Pre-training with Generative Contrastive Learning.

X 2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks.

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

Divide and Conquer: Hybrid Pre-training for Person Search

Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception