Abstract

Nowadays, the amount of data that has been collected or generated in many sectors has been growing exponentially because of the rapid development of technologies such as the Internet of Things (IoT). Additionally, the nature of this data is imbalanced. The need for extracting valuable information for decision support from this data poses a challenge to the scientific community to find a solution to cope with large imbalanced data. In previous work, our cost-sensitive differential evolution classification algorithm showed efficient performance for handling highly imbalanced data sets. However, our algorithm shows inefficient performance when applied to big data sets, thus lacking to scale with data size increases. In this paper, we design and implement a parallel version of our cost-sensitive differential evolution classifier using the Apache Spark framework (SCDE). The aim is to handle large and binary imbalanced data. The main idea of the algorithm is to find the optimal centroid for each target label using differential evolution by minimizing the total misclassification cost and then assign unlabeled data points to the closest centroid. Our experiments include a real data set that is based on intrusion detection in order to evaluate our algorithm's scalability and performance. The experimental results show that SCDE efficiently handles imbalanced binary data and scales very well with data size increases. Moreover, the speedup and scaleup results that are obtained by SCDE are close to linear.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call