Towards Improved Classification Accuracy on Highly Imbalanced Text Dataset Using Deep Neural Language Models

Sarang Shaikh,Sher Muhammad Daudpota,Ali Shariq Imran,Zenun Kastrati

doi:10.3390/app11020869

Sarang Shaikh, Sher Muhammad Daudpota + Show 2 more

Open Access

https://doi.org/10.3390/app11020869

Copy DOI

Abstract

Data imbalance is a frequently occurring problem in classification tasks where the number of samples in one category exceeds the amount in others. Quite often, the minority class data is of great importance representing concepts of interest and is often challenging to obtain in real-life scenarios and applications. Imagine a customers’ dataset for bank loans-majority of the instances belong to non-defaulter class, only a small number of customers would be labeled as defaulters, however, the performance accuracy is more important on defaulters labels than non-defaulter in such highly imbalance datasets. Lack of enough data samples across all the class labels results in data imbalance causing poor classification performance while training the model. Synthetic data generation and oversampling techniques such as SMOTE, AdaSyn can address this issue for statistical data, yet such methods suffer from overfitting and substantial noise. While such techniques have proved useful for synthetic numerical and image data generation using GANs, the effectiveness of approaches proposed for textual data, which can retain grammatical structure, context, and semantic information, has yet to be evaluated. In this paper, we address this issue by assessing text sequence generation algorithms coupled with grammatical validation on domain-specific highly imbalanced datasets for text classification. We exploit recently proposed GPT-2 and LSTM-based text generation models to introduce balance in highly imbalanced text datasets. The experiments presented in this paper on three highly imbalanced datasets from different domains show that the performance of same deep neural network models improve up to 17% when datasets are balanced using generated text.

Highlights

This study aims to evaluate the performance of the proposed LSTM-based text generation algorithm trained for domain-specific applications and subject them to the classification task, in addition to evaluating the recently proposed GPT-2 [6] model
The findings showed that the GPT-2 model was able to generate text that looks like a patent claim using only few training steps. [7] is another research work which focuses on text generation using pre-trained
If we look into all the three tables, the first three metrics (BLEU, METEOR and ROUGE_L) have very low score values for the generated texts with reference to original texts

Summary

Introduction

Data imbalance is a common issue in classification tasks having adverse effects on the model’s performance. The availability of an equal number of samples per category for a real-case scenario in most application domains is nearly impossible. Often common classes end up with far more samples than the least common ones. Researchers address this issue by utilizing data over-sampling techniques for generating synthetic data from the original training samples. Techniques such as synthetic minority oversampling technique (SMOTE) [1] and AdaSyn [2] works well. For Images, deep learning methods employing generative adversarial networks (GAN) such as CycleGAN [3]

Objectives

Methods

Results

Discussion

Conclusion