Text Data Augmentation for the Korean Language

Dang Thanh Vu,Gwanghyun Yu,Jinyoung Kim,Chilwoo Lee

doi:10.3390/app12073425

Dang Thanh Vu, Gwanghyun Yu + Show 2 more

Open Access

https://doi.org/10.3390/app12073425

Copy DOI

Journal: Applied sciences	Publication Date: Mar 28, 2022
Citations: 7	License type: CC BY 4.0

Affiliation: Chonnam National University

Abstract

Data augmentation (DA) is a universal technique to reduce overfitting and improve the robustness of machine learning models by increasing the quantity and variety of the training dataset. Although data augmentation is essential in vision tasks, it is rarely applied to text datasets since it is less straightforward. Some studies have concerned text data augmentation, but most of them are for the majority languages, such as English or French. There have been only a few studies on data augmentation for minority languages, e.g., Korean. This study fills the gap by demonstrating several common data augmentation methods and Korean corpora with pre-trained language models. In short, we evaluate the performance of two text data augmentation approaches, known as text transformation and back translation. We compare these augmentations among Korean corpora on four downstream tasks: semantic textual similarity (STS), natural language inference (NLI), question duplication verification (QDV), and sentiment classification (STC). Compared to cases without augmentation, the performance gains when applying text data augmentation are 2.24%, 2.19%, 0.66%, and 0.08% on the STS, NLI, QDV, and STC tasks, respectively.

Full Text