Data augmentation using virtual word insertion techniques in text classification tasks

Zhigao Long,Jiawen Shi,Xin Ma,Hong Li

doi:10.1111/exsy.13519

Abstract

AbstractLabelling multiple training examples for text classification models is usually time‐consuming and complex. Data augmentation can be used to automatically expand the dataset by transforming the original data. However, it may cause semantic changes without modifying the labels, which reduces the effectiveness of the classifiers. In this paper, we propose a data‐augmentation method called the virtual word insertion technique, which generates new sentences by randomly inserting virtual words into existing sentences. Two methods are used to achieve virtual word embedding: unweighted average and weighted average. Furthermore, a new concept of weight is proposed: the class deviation factor, which demonstrates the correlation between words and classes. Based on this new concept, different weights are assigned to words of different classes. Experiments are conducted on five different classification tasks. Ablation experiments are also performed to explore the effects of random operation and number of augmented sentences for classification results. The results of these experiments show that our method improves the classification performance and outperforms two other contrasting data‐augmentation methods in automatically augmenting the dataset. Compared to raw datasets, the average accuracy improvement of our method is 3.5% for a small‐scale dataset and 1% for a large‐scale dataset.

Full Text