Abstract
State-of-the-art performances for natural language processing tasks are achieved by supervised learning, specifically, by fine-tuning pre-trained language models such as BERT (Bidirectional Encoder Representation from Transformers). With increasingly accurate models, the size of the fine-tuned pre-training corpus is becoming larger and larger. However, very few studies have explored the selection of pre-training corpus. Therefore, this paper proposes a data enhancement-based domain pre-training method. At first, a pre-training task and a downstream fine-tuning task are jointly trained to alleviate the catastrophic forgetting problem generated by existing classical pre-training methods. Then, based on the hard-to-classify texts identified from downstream tasks’ feedback, the pre-training corpus can be reconstructed by selecting the similar texts from it. The learning of the reconstructed pre-training corpus can deepen the model’s understanding of undeterminable text expressions, thus enhancing the model’s feature extraction ability for domain texts. Without any pre-processing of the pre-training corpus, the experiments are conducted for two tasks, named entity recognition (NER) and text classification (CLS). The results show that learning the domain corpus selected by the proposed method can supplement the model’s understanding of domain-specific information and improve the performance of the basic pre-training model to achieve the best results compared with other benchmark methods.
Full Text
Topics from this Paper
Downstream Task
Bidirectional Encoder Representation From Transformers
Pre-Training Method
Bidirectional Encoder Representation
Pre-training Task
+ Show 5 more
Create a personalized feed of these topics
Get StartedTalk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Similar Papers
BMC Medical Informatics and Decision Making
Apr 5, 2022
Osteoarthritis and Cartilage
Apr 1, 2020
Applied Sciences
Aug 18, 2020
Applied Sciences
Jun 28, 2021
Scientific Programming
May 3, 2021
JMIR Medical Informatics
Feb 3, 2021
Nov 28, 2021
International Journal of Computational Intelligence Systems
Sep 22, 2021
Complex & Intelligent Systems
Jul 15, 2022
Applied Sciences
Applied Sciences
Nov 27, 2023
Applied Sciences
Nov 27, 2023
Applied Sciences
Nov 27, 2023
Applied Sciences
Nov 27, 2023
Applied Sciences
Nov 27, 2023
Applied Sciences
Nov 27, 2023
Applied Sciences
Nov 27, 2023
Applied Sciences
Nov 27, 2023
Applied Sciences
Nov 27, 2023
Applied Sciences
Nov 27, 2023