Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification

Sachin Subhash Patil,Shefali Pratap Sonavane

doi:10.1007/978-3-319-53474-9_3

Abstract

Facing hundreds of gigabytes of data has triggered a need to reconsider data management options. There is a tremendous requirement to study data sets beyond the capability of commonly used software tools to capture, curate and manage within a tolerable elapsed time and also beyond the processing feasibility of the single machine architecture. In addition to the traditional structured data, the new avenue of NoSQL Big Data has urged a call to experimental techniques and technologies that require ventures to re-integrate. It helps to discover large hidden values from huge datasets that are complex, diverse and of a massive scale. In many of the real world applications, classification of imbalanced datasets is the point of priority concern. The standard classifier learning algorithms assume balanced class distribution and equal misclassification costs; as a result, the classification of datasets having imbalanced class distribution has produced a notable drawback in performance obtained by the most standard classifier learning algorithms. Most of the classification methods focus on two-class imbalance problem inspite of multi-class imbalance problem, which exist in real-world domains. A methodology is introduced for single-class/multi-class imbalanced data sets (Lowest vs. Highest—LVH) with enhanced over_sampling (O.S.) techniques (MEre Mean Minority Over_Sampling Technique—MEMMOT, Majority Minority Mix mean—MMMm, Nearest Farthest Neighbor_Mid—NFN-M, Clustering Minority Examples—CME, Majority Minority Cluster Based Under_Over Sampling Technique—MMCBUOS, Updated Class Purity Maximization—UCPM) to improve classification. The study is based on broadly two views: either to compare the enhanced non-cluster techniques to prior work or to have a clustering based approach for advance O.S. techniques. Finally, this balanced data is to be applied to form Random Forest (R.F.) tree for classification. O.S. techniques are projected to apply on imbalanced Big Data using mapreduce environment. Experiments are suggested to perform on Apache Hadoop and Apache Spark , using different datasets from UCI/KEEL repository. Geometric mean, F-measures, Area under curve (AUC), Average accuracy, Brier scores are used to measure the performance of this classification.

Full Text