Smart Data Driven Decision Trees Ensemble Methodology for Imbalanced Big Data

Diego García-Gil,Salvador García,Ning Xiong,Francisco Herrera

doi:10.1007/s12559-024-10295-z

Abstract

Differences in data size per class, also known as imbalanced data distribution, have become a common problem affecting data quality. Big Data scenarios pose a new challenge to traditional imbalanced classification algorithms, since they are not prepared to work with such amount of data. Split data strategies and lack of data in the minority class due to the use of MapReduce paradigm have posed new challenges for tackling the imbalance between classes in Big Data scenarios. Ensembles have been shown to be able to successfully address imbalanced data problems. Smart Data refers to data of enough quality to achieve high-performance models. The combination of ensembles and Smart Data, achieved through Big Data preprocessing, should be a great synergy. In this paper, we propose a novel Smart Data driven Decision Trees Ensemble methodology for addressing the imbalanced classification problem in Big Data domains, namely SD_DeTE methodology. This methodology is based on the learning of different decision trees using distributed quality data for the ensemble process. This quality data is achieved by fusing random discretization, principal components analysis, and clustering-based random oversampling for obtaining different Smart Data versions of the original data. Experiments carried out in 21 binary adapted datasets have shown that our methodology outperforms random forest.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Smart Data Driven Decision Trees Ensemble Methodology for Imbalanced Big Data

Abstract

Talk to us

Similar Papers

More From: Cognitive Computation

Lead the way for us

Journal: Cognitive Computation	Publication Date: May 31, 2024
License type: cc-by

Similar Papers

Analysis and design of scalable pre-processing techniques of instances for imbalanced Big Data problems. Applications in humanitarian emergencies situations.
María José Basgall
Journal of Computer Science and Technology | VOL. 22
María José BasgallMaría José Basgall
17 Oct 2022
Journal of Computer Science and Technology | VOL. 22

불균형 데이터 환경에서 변수가중치를 적용한 사례기반추론 기반의 고객반응 예측
...
Journal of Intelligence and Information Systems | VOL. 21
, et. al. ...
31 Mar 2015
Journal of Intelligence and Information Systems | VOL. 21

Output Thresholding for Ensemble Learners and Imbalanced Big Data
Justin M Johnson ... Taghi M Khoshgoftaar
-
Justin M Johnson, et. al.Justin M Johnson ... Taghi M Khoshgoftaar
01 Nov 2021
01 Nov 2021

Comparative Performance of Deep Learning and Machine Learning Algorithms on Imbalanced Handwritten Data
A’Inur A’Fifah ... Abdullah Ahmad
International Journal of Advanced Computer Science and Applications | VOL. 9
A’Inur A’Fifah, et. al.A’Inur A’Fifah ... Abdullah Ahmad
01 Jan 2018
International Journal of Advanced Computer Science and Applications | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Smart Data Driven Decision Trees Ensemble Methodology for Imbalanced Big Data

Abstract

Talk to us

Similar Papers

More From: Cognitive Computation