Data Augmentation Based on Topic Relevance to Enhance Text Classification in Scarcity of Training Data

Ming-Xian Zou,Chih-Hsueh Lin,Chin-Shiuh Shieh,Mong-Fong Horng,Chun-Chih Lo

doi:10.1007/978-981-99-0105-0_31

Abstract

Acquiring data in some fields is formidable and faces challenges such as imbalanced datasets and data scarcity problems during data collection. Made text classification models become more prone to overfitting and bias toward a particular category. Thus, generating an extensive and effective dataset to improve the model performance becomes one of the important research topics. One of the fastest and most effective methods is data augmentation techniques. This study proposes a novel data augmentation method based on topic relevance for text classification. First, the BERT model is applied to generate its semantic vector of the text data, and text similarity analysis is performed in each category to determine the semantic similarity between text contents of the already limited and scarce datasets. Text data with a high correlation with other text data in the same category will then be extracted. This is because text data that are highly correlated with each other imply that the topic of these texts is most likely to be relevant. Thus, by performing keyword extraction on the most relevant text data to obtain the keywords from these highly correlated text data, these keywords are then shuffled and rejoined to generate massive and new high-quality augmented data. By calibrating the amount of newly generated augmented data according to the degree of balances in each category, the augmented text data may alter the category balance representation. From the experiment, the overall results indicate that with some computational effort, a significant increase in augmented data can not only alleviate the effect caused by imbalanced datasets but also increases the accuracy in text classification when data scarcity is considered.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Data Augmentation Based on Topic Relevance to Enhance Text Classification in Scarcity of Training Data

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Malignant skin melanoma detection using image augmentation by oversampling in nonlinear lower-dimensional embedding manifold
Olusola Oluwakemi Abayomi-Alli ... Adebayo Abayomi-Alli
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES | VOL. 29
Olusola Oluwakemi Abayomi-Alli, et. al.Olusola Oluwakemi Abayomi-Alli ... Adebayo Abayomi-Alli
04 Oct 2021
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES | VOL. 29

Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture
Anil Ahmed ... Imran Hameed
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23
Anil Ahmed, et. al.Anil Ahmed ... Imran Hameed
15 Apr 2024
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23

A Survey on Data Augmentation Techniques
K Nanthini ... D Sivabalaselvamani
-
K Nanthini, et. al.K Nanthini ... D Sivabalaselvamani
23 Feb 2023
23 Feb 2023

Analysis of the Effect of Audio Data Augmentation Techniques on Phone Digit Recognition For Algerian Arabic Dialect
Khaled Lounnas ... Mourad Abbas
-
Khaled Lounnas, et. al.Khaled Lounnas ... Mourad Abbas
17 Sep 2022
17 Sep 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Data Augmentation Based on Topic Relevance to Enhance Text Classification in Scarcity of Training Data

Abstract

Talk to us

Similar Papers