Abstract

Acquiring data in some fields is formidable and faces challenges such as imbalanced datasets and data scarcity problems during data collection. Made text classification models become more prone to overfitting and bias toward a particular category. Thus, generating an extensive and effective dataset to improve the model performance becomes one of the important research topics. One of the fastest and most effective methods is data augmentation techniques. This study proposes a novel data augmentation method based on topic relevance for text classification. First, the BERT model is applied to generate its semantic vector of the text data, and text similarity analysis is performed in each category to determine the semantic similarity between text contents of the already limited and scarce datasets. Text data with a high correlation with other text data in the same category will then be extracted. This is because text data that are highly correlated with each other imply that the topic of these texts is most likely to be relevant. Thus, by performing keyword extraction on the most relevant text data to obtain the keywords from these highly correlated text data, these keywords are then shuffled and rejoined to generate massive and new high-quality augmented data. By calibrating the amount of newly generated augmented data according to the degree of balances in each category, the augmented text data may alter the category balance representation. From the experiment, the overall results indicate that with some computational effort, a significant increase in augmented data can not only alleviate the effect caused by imbalanced datasets but also increases the accuracy in text classification when data scarcity is considered.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call