Abstract

A well-known limitation of existing rule-based text augmentation is that it cannot be applied to other languages because it depends on grammatical and structural characteristics. Moreover, most text Generative Adversarial Networks (GAN) are unstable in training due to inefficient generator optimization and rely on maximum likelihood pre-training. This paper addresses the above problems by proposing a novel augmentation method with a Sentence Generator (SG) and Sentence Discriminator (SD) for Iterative Translation-based Data Augmentation (ITDA). This paper makes three original contributions. First, the ITDA SG is designed to provide universal multiple-language support by generating comprehensive augmented sentences through serial and parallel iterations of an existing translator, such as Google Translate. Second, given that the quality of the generated sentences varies depending on the translation combination or the type of sentence, the ITDA addresses this issue using a discriminator to achieve sentence augmentation, which can select high-quality augmented data using a text classifier. Third, the ITDA can perform sentence augmentation for 109 different languages using discriminators based on text classifiers trained for a specific language or type of data set. Extensive experiments are conducted to evaluate the efficacy of the ITDA using a Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), CNN-BiLSTM, and self-attention. The results demonstrate that when the ITDA is applied to 480 sentence classification tasks, the average accuracy increases by 4.24%.

Highlights

  • Many organizations collect increasing amounts of data to build complex data analytics, machine learning, and Artificial Intelligence models [1]

  • This paper proposes a novel Iterative Translation-based Data Augmentation (ITDA) method that can be applied to multiple languages

  • We develop a deep learning model using a Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) as the ITDA Sentence Discriminator (SD)

Read more

Summary

INTRODUCTION

Many organizations collect increasing amounts of data to build complex data analytics, machine learning, and Artificial Intelligence models [1]. Machine learning models trained on smaller data sets often do not perform well enough. Some previous studies address this limitation in text augmentation by focusing on rule-based methods that is highly language dependent [5, 6, 7]. Such data augmentation methods have limited utility for other languages. This paper proposes a novel Iterative Translation-based Data Augmentation (ITDA) method that can be applied to multiple languages. By intelligently combining an SG and SD, the ITDA can augment sentences in 109 languages supported by Google Translate using a text classifier best suited to the specific language and data set. By changing the SG parameters (i, n), the ITDA improves performance in most cases and can be applied to languages with different grammar rules

RELATED WORK
BACKGROUND
ITERATIVE TRANSLATION-BASED DATA AUGMENTATION METHOD
EXPERIMENTS
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call