Iterative Translation-Based Data Augmentation Method for Text Classification Tasks

Sangwon Lee,Ling Liu,Wonik Choi

doi:10.1109/access.2021.3131446

Abstract

A well-known limitation of existing rule-based text augmentation is that it cannot be applied to other languages because it depends on grammatical and structural characteristics. Moreover, most text Generative Adversarial Networks (GAN) are unstable in training due to inefficient generator optimization and rely on maximum likelihood pre-training. This paper addresses the above problems by proposing a novel augmentation method with a Sentence Generator (SG) and Sentence Discriminator (SD) for Iterative Translation-based Data Augmentation (ITDA). This paper makes three original contributions. First, the ITDA SG is designed to provide universal multiple-language support by generating comprehensive augmented sentences through serial and parallel iterations of an existing translator, such as Google Translate. Second, given that the quality of the generated sentences varies depending on the translation combination or the type of sentence, the ITDA addresses this issue using a discriminator to achieve sentence augmentation, which can select high-quality augmented data using a text classifier. Third, the ITDA can perform sentence augmentation for 109 different languages using discriminators based on text classifiers trained for a specific language or type of data set. Extensive experiments are conducted to evaluate the efficacy of the ITDA using a Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), CNN-BiLSTM, and self-attention. The results demonstrate that when the ITDA is applied to 480 sentence classification tasks, the average accuracy increases by 4.24%.

Highlights

Many organizations collect increasing amounts of data to build complex data analytics, machine learning, and Artificial Intelligence models [1]
This paper proposes a novel Iterative Translation-based Data Augmentation (ITDA) method that can be applied to multiple languages
We develop a deep learning model using a Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) as the ITDA Sentence Discriminator (SD)

Summary

INTRODUCTION

Many organizations collect increasing amounts of data to build complex data analytics, machine learning, and Artificial Intelligence models [1]. Machine learning models trained on smaller data sets often do not perform well enough. Some previous studies address this limitation in text augmentation by focusing on rule-based methods that is highly language dependent [5, 6, 7]. Such data augmentation methods have limited utility for other languages. This paper proposes a novel Iterative Translation-based Data Augmentation (ITDA) method that can be applied to multiple languages. By intelligently combining an SG and SD, the ITDA can augment sentences in 109 languages supported by Google Translate using a text classifier best suited to the specific language and data set. By changing the SG parameters (i, n), the ITDA improves performance in most cases and can be applied to languages with different grammar rules

RELATED WORK

BACKGROUND

ITERATIVE TRANSLATION-BASED DATA AUGMENTATION METHOD

EXPERIMENTS

Findings

Conclusion