Abstract

Thanks to a large amount of high-quality labeled data (instances), deep learning offers significant performance benefits in a variety of tasks. However, instance construction is very time-consuming and laborious, and it is a big challenge for natural language processing (NLP) tasks in many fields. For example, the instances of the question matching dataset CHIP in the medical field are only 2.7% of the general field dataset LCQMC, and its performance is only 79.19% of the general field. Due to the scarcity of instances, people often use methods such as data augmentation, robust learning, and the pre-trained model to alleviate this problem. Text data augmentation and pre-trained models are two of the most commonly used methods to solve this problem in NLP. However, current experiments have shown that the use of general data augmentation techniques may have limited or even negative effects on the pre-trained model. In order to fully understand the reasons for this result, this paper uses three types of data quality assessment methods from two levels of label-independent and label-dependent, and then select, filter and transform the results of the three text data augmentation methods. Our experiments on both generic and specialized (medical) fields have shown that through analysis, selection/filtering, and transformation of augmented instances, the performance of intent understanding and question matching in the pre-trained model can be effectively improved.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.