Abstract

Text classification is one of the fundamental tasks in natural language processing (NLP). In the real world, data and resources available for text classification are limited. One of the issues with labeled data is imbalanced data. The problem of imbalanced data affects the performance and accuracy of the model because the model only focuses on data with majority labels. This impacts the model performance, which tends to classify correctly for the majority label only. Meanwhile, in some cases, it is more important for the minority label to be predicted correctly. Therefore, the measure of model accuracy cannot describe the true performance of the model. To overcome this, an oversampling approach is carried out. Text-based oversampling is known as text augmentation. However, NLP resources for the Indonesian language are still limited, especially in performing text augmentation. Therefore, this research conducts the development of a web application to augment Indonesian text automatically. The application was built using the prototype method. Users can perform augmentation automatically for the entire text in the dataset. Users can select preferred augmentation techniques and are required to upload datasets as input. The output of the application is the same dataset file as the input, with an additional column containing synthetic text augmented by the application.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call