Multi-Cluster Based Approach for skewed Data in Data Mining

Mr.Rushi Longadge

doi:10.9790/0661-1266673

Abstract

In data mining difficulties are encounters when applying machine learning techniques to real-world data, which frequently show skewness properties. A typical example from industry where skewed data is an intrinsic problem is fraud detection in finance data, medical diagnosis on rare disease, finding network intrusion in network. This problem is also known as class imbalance problem. The class imbalance problem define as the sample of one class may be much less number than another class in data set. There are many technology developed for handling class imbalance. Basically designed approaches are divided into two types. First is designed a new algorithm which improves the minority class prediction, second modify the number samples in existing class, it also known as data pre-processing. Under-sampling is a very popular data pre-processing approach to deal with class imbalance problem. Under-sampling approach is very efficient, it only use the subset of the majority class. The drawback of under-sampling is that it removes away many useful majority class samples. To solve this problem we propose multi cluster-based majority under-sampling and random minority oversampling approach. Compared to under-sampling, cluster-based random under- sampling can effectively avoid the important information loss of majority class. Keyword: Skewed data, Random under-sampling, class Imbalance problem, clustering, imbalance dataset.

Full Text