Abstract

Imbalanced data problem is a big challenge for judicial data analysis since it often leads to a low accuracy of the data classification. Synthesizing new samples by means of oversampling is a useful method to handle this problem. However, most oversampling algorithms have been obtained regardless of noise samples and the data distribution has not been fully taken into consideration. For this purpose, an improved cluster-based synthetic oversampling algorithm, namely distributed fuzzy-based adaptive synthetic oversampling (DFBASO) algorithm, is proposed by simultaneously considering the distribution of inter-class, the distribution of intra-cluster and the characteristic of noise samples. The proposed DFBASO algorithm is equipped with: 1) fuzzy c-means (FCM) clustering algorithm application for samples of minority and majority classes; 2) weighted distribution based on two factors including the inter-class distance and the cluster capacity; and 3) a mixed synthetic method under different distribution cases of intra-cluster. Finally, the judicial data set and eight public data sets are utilized to show the effectiveness and universal applicability of the proposed DFBASO algorithm for the imbalanced data classification.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call