SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification

Pablo D Gutiérrez,José M Benítez,Miguel Lastra,Francisco Herrera

doi:10.1007/s13748-017-0128-2

Pablo D Gutiérrez, José M Benítez + Show 2 more

https://doi.org/10.1007/s13748-017-0128-2

Copy DOI

Abstract

Nowadays, it is usual to work with large amounts of data since our capacity of collecting and storing information has increased significantly. The extraction of knowledge from these scenarios is commonly known as “Big Data,” and it is performed on large clusters with MapReduce platforms. Imbalanced classification poses a problem both in traditional and Big Data learning scenarios. Data sampling is one of the ways that allows to improve the performance on imbalanced problems. A commodity hardware-based method for Big Data problems can offload these computations from the expensive and highly demanded hardware that MapReduce platforms require. The characteristics of some sampling methods make them suitable to be adapted to commodity hardware, taking advantage of the parallel computation capabilities of graphics processing units. SMOTE is one of the most popular oversampling methods which is based on the nearest neighbor rule. The proposed SMOTE-GPU efficiently handles large datasets (several millions of instances) on a wide variety of commodity hardware, including a laptop computer.

Full Text