Toward a Better Text Data Augmentation via Filtering and Transforming Augmented Instances

Fei Xia,Shizhu He,Kang Liu,Shengping Liu,Jun Zhao

doi:10.1007/978-981-16-6471-7_15

Abstract

Thanks to a large amount of high-quality labeled data (instances), deep learning offers significant performance benefits in a variety of tasks. However, instance construction is very time-consuming and laborious, and it is a big challenge for natural language processing (NLP) tasks in many fields. For example, the instances of the question matching dataset CHIP in the medical field are only 2.7% of the general field dataset LCQMC, and its performance is only 79.19% of the general field. Due to the scarcity of instances, people often use methods such as data augmentation, robust learning, and the pre-trained model to alleviate this problem. Text data augmentation and pre-trained models are two of the most commonly used methods to solve this problem in NLP. However, current experiments have shown that the use of general data augmentation techniques may have limited or even negative effects on the pre-trained model. In order to fully understand the reasons for this result, this paper uses three types of data quality assessment methods from two levels of label-independent and label-dependent, and then select, filter and transform the results of the three text data augmentation methods. Our experiments on both generic and specialized (medical) fields have shown that through analysis, selection/filtering, and transformation of augmented instances, the performance of intent understanding and question matching in the pre-trained model can be effectively improved.

Full Text