Improved classification of large imbalanced data sets using rationalized technique: Updated Class Purity Maximization Over_Sampling Technique (UCPMOT)

Sachin S Patil,Shefali P Sonavane

doi:10.1186/s40537-017-0108-1

Abstract

The huge variety of NoSQL Big Data has tossed a need for new pathways to store, process and analyze it. The quantum of data created is inconceivable along with a mixed breath of unknown veracity and creative visualization. The new trials of frameworks help to find substantial unidentified values from massive data sets. They have added an exceptional dimension to the pre-processing and contextual conversion of the data sets for needful analysis. In addition, handling of ambitious imbalanced data sets has acknowledged an intimation of alarm. Traditional classifiers are unable to discourse the precise need of grouping for such data sets. Over_sampling of the minority classes help to improve the performance. Updated Class Purity Maximization Over_Sampling Technique (UCPMOT) is a rationalized technique proposed to handle imbalanced data sets using exclusive safe-level based synthetic sample creation. It addresses the multi-class problem in alignment to a newly induced method namely lowest versus highest. The projected technique experiments with several data sets from the UCI repository. The underlying bed of mapreduce environment encompasses the distributed processing approach on Apache Hadoop framework. Several classifiers help to authorize the classification results using parameters like F-measure and AUC values. The experimental conclusions quote the dominance of UCPMOT over the benchmarking techniques.

Highlights

The huge variety of NoSQL Big Data has tossed a need for new pathways to store, process and analyze it
An advanced cluster based technique (UCPMOT) dealing with binary-class/multiclass imbalanced Big Data sets is presented in this paper
The Updated Class Purity Maximization Over_Sampling Technique (UCPMOT) works with MEre Mean Minority Over_Sampling Technique (MEMMOT)/Minority Majority Mix mean Over_Sampling Technique (MMMmOT)/NF_N + Nearest Farthest Neighbor_Mid Over_Sampling Technique (MOT)/Clustering Minority Examples Over_Sampling Technique (CMEOT) using synthetic samples creation (SSS) to achieve the improved F-measure and AUC values

Summary

Introduction

The huge variety of NoSQL Big Data has tossed a need for new pathways to store, process and analyze it. It postulates the category of over_sampling techniques used for balancing the binary/multi-class data sets. Experimental context The objective of the trial work is to validate the efficiency of planned techniques for dealing with the class imbalance problem in Big Data sets.

Results

Conclusion