Automated Indonesian Text Augmentation with Web-Based Application Using Flask Framework

Iftitah Athiyyah Rahma,Lya Hulliyyatus Suadaa

doi:10.34123/icdsos.v2023i1.324

Abstract

Text classification is one of the fundamental tasks in natural language processing (NLP). In the real world, data and resources available for text classification are limited. One of the issues with labeled data is imbalanced data. The problem of imbalanced data affects the performance and accuracy of the model because the model only focuses on data with majority labels. This impacts the model performance, which tends to classify correctly for the majority label only. Meanwhile, in some cases, it is more important for the minority label to be predicted correctly. Therefore, the measure of model accuracy cannot describe the true performance of the model. To overcome this, an oversampling approach is carried out. Text-based oversampling is known as text augmentation. However, NLP resources for the Indonesian language are still limited, especially in performing text augmentation. Therefore, this research conducts the development of a web application to augment Indonesian text automatically. The application was built using the prototype method. Users can perform augmentation automatically for the entire text in the dataset. Users can select preferred augmentation techniques and are required to upload datasets as input. The output of the application is the same dataset file as the input, with an additional column containing synthetic text augmented by the application.

Full Text