A Novel Ensemble Method for Improving Disease Burden Detection in Imbalanced Epidemiological Data and Large Cancer Registry

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Predictive cancer epidemiology applies machine learning to analyze historical medical data for early diagnosis, reduce cancer spread, and identify high-risk individuals. However, data quality issues and class imbalance often hinder these tasks. Existing class imbalance solutions are prone to information loss, weak recognition of the majority class, increased false positives, and reduced reliability. To overcome these drawbacks, we propose a novel enhanced boosting-based method, JEUBoost. First, JEUBoost reduces the majority class size by using Gaussian probability density to estimate sample probabilities and information entropy to measure sample informativeness, creating a more balanced dataset. Second, a modified metaheuristic algorithm, Jaya, is employed to improve probability estimation of high-quality samples by adjusting relevant parameters of the Gaussian model. Third, a customized cost function for Jaya is formulated as an optimization problem to minimize the model’s error rate. Experimental results demonstrate that the performance metrics, including Accuracy, Precision, Recall, F1-score, G-means, and Precision–Recall curves, achieved by JEUBoost range from 68% to 99%. Compared to conventional class imbalance methods, JEUBoost improved Precision by 2.6%, Recall by 5.5%, G-means by 3.2%, and average Precision of the curve by 8.7%, while reducing variance by 82.71%, demonstrating consistent performance gains across all key metrics.

Similar Papers
  • Conference Article
  • Cite Count Icon 8
  • 10.1109/iri.2013.6642460
The importance of performance metrics within wrapper feature selection
  • Aug 1, 2013
  • Randall Wald + 2 more

Many important datasets are affected by the problem of high dimensionality (having a large number of attributes or features), which can result in complex and time-consuming classification models. Feature selection techniques try to identify an optimal subset of features which may show improved classification performance as well as identify important features for the application at hand. Wrapper feature selection in particular uses a classifier to discover which feature subsets are most useful. However, feature selection can be affected by another dataset problem: imbalanced data. When one class outnumbers the other class(es), the chosen features may not reflect those most important to all classes - especially when wrapper feature selection uses a performance metric which does not consider class imbalance. No previous work has examined how the choice of performance metric within wrapper-based feature selection will affect classification performance. To study this effect, in this paper we consider two high-dimensional datasets drawn from the field of Twitter profile mining, both of which exhibit class imbalance. Using the Logistic Regression learner, we perform wrapper feature selection followed by classification, using five different performance metrics both (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision Recall Curve, Best Arithmetic Mean of TPR and TNR, Best Geometric Mean of TPR and TNR, and Overall Accuracy) for the wrapper and for evaluating the classification model. We find that performance metrics which take class imbalance into account, especially the Area Under the Precision-Recall Curve, are far more effective than Overall Accuracy when used within the wrapper, producing much better performance as evaluated by the metrics which consider imbalance. In fact, even when Overall Accuracy is the classification metric, it is not the best metric to use within the wrapper. In addition, we find that there is no direct connection between the metric inside the wrapper and used for classification evaluation: the metrics show similar patterns across all four balance-aware metrics (e.g., all but Overall Accuracy).

  • Conference Article
  • Cite Count Icon 15
  • 10.1109/iri.2019.00026
A Comparison of Performance Metrics with Severely Imbalanced Network Security Big Data
  • Jul 1, 2019
  • Tawfiq Hasanin + 2 more

Severe class imbalance between the majority and minority classes in large datasets can prejudice Machine Learning classifiers toward the majority class. Our work uniquely consolidates two case studies, each utilizing three learners implemented within an Apache Spark framework, six sampling methods, and five sampling distribution ratios to analyze the effect of severe class imbalance on big data analytics. We use three performance metrics to evaluate this study: Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, and Geometric Mean. In the first case study, models were trained on one dataset (POST) and tested on another (SlowlorisBig). In the second case study, the training and testing dataset roles were switched. Our comparison of performance metrics shows that Area Under the Precision-Recall Curve and Geometric Mean are sensitive to changes in the sampling distribution ratio, whereas Area Under the Receiver Operating Characteristic Curve is relatively unaffected. In addition, we demonstrate that when comparing sampling methods, borderline-SMOTE2 outperforms the other methods in the first case study, and Random Undersampling is the top performer in the second case study.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 21
  • 10.1186/s40537-020-00301-0
Investigating class rarity in big data
  • Mar 16, 2020
  • Journal of Big Data
  • Tawfiq Hasanin + 3 more

In Machine Learning, if one class has a significantly larger number of instances (majority) than the other (minority), this condition is defined as class imbalance. With regard to datasets, class imbalance can bias the predictive capabilities of Machine Learning algorithms towards the majority (negative) class, and in situations where false negatives incur a greater penalty than false positives, this imbalance may lead to adverse consequences. Our paper incorporates two case studies, each utilizing a unique approach of three learners (gradient-boosted trees, logistic regression, random forest) and three performance metrics (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, Geometric Mean) to investigate class rarity in big data. Class rarity, a notably extreme degree of class imbalance, was effected in our experiments by randomly removing minority (positive) instances to artificially generate eight subsets of gradually decreasing positive class instances. All model evaluations were performed through Cross-Validation. In the first case study, which uses a Medicare Part B dataset, performance scores for the learners generally improve with the Area Under the Receiver Operating Characteristic Curve metric as the rarity level decreases, while corresponding scores with the Area Under the Precision-Recall Curve and Geometric Mean metrics show no improvement. In the second case study, which uses a dataset built from Distributed Denial of Service attack attack data (POSTSlowloris Combined), the Area Under the Receiver Operating Characteristic Curve metric produces very high-performance scores for the learners, with all subsets of positive class instances. For the second study, scores for the learners generally improve with the Area Under the Precision-Recall Curve and Geometric Mean metrics as the rarity level decreases. Overall, with regard to both case studies, the Gradient-Boosted Trees (GBT) learner performs the best.

  • Research Article
  • Cite Count Icon 6
  • 10.1016/j.geodrs.2024.e00821
Soil textural class modeling using digital soil mapping approaches: Effect of resampling strategies on imbalanced dataset predictions
  • Jun 15, 2024
  • Geoderma Regional
  • Fereshteh Mirzaei + 4 more

Soil textural class modeling using digital soil mapping approaches: Effect of resampling strategies on imbalanced dataset predictions

  • Research Article
  • Cite Count Icon 9
  • 10.33093/jiwe.2024.3.2.17
Ensemble-SMOTE: Mitigating Class Imbalance in Graduate on Time Detection
  • Jun 13, 2024
  • Journal of Informatics and Web Engineering
  • Theng-Jia Law + 4 more

In education, detecting students graduating on time is difficult due to high data complexity. Researchers have employed various approaches in identifying on-time graduation with Machine Learning, but it remains a challenging task due to the class imbalance in the dataset. This study has aimed to (i) compare various class imbalance treatment methods with different sampling ratios, (ii) propose an ensemble class imbalance treatment method in mitigating the problem of class imbalance, and (iii) develop and evaluate predictive models in identifying the likelihood of students graduating on time during their studies in university. The dataset is collected from 4007 graduates of a university from year 2021 and 2022 with 41 variables. After feature selection, various class imbalance treatment methods were compared with different sampling ratios ranging from 50% to 90%. Moreover, Ensemble-SMOTE is proposed to aggregate the dataset generated by Synthetic Minority Oversampling Technique variants in mitigating the problem of class imbalance effectively. The dataset generated by class imbalance treatment methods were used as the input of the predictive models in detecting on-time graduation. The predictive models were evaluated based on accuracy, precision, recall, F0.5-score, F1-score, F2-score, Area under the Curve, and Area Under the Precision-Recall Curve. Based on the findings, Logistic Regression with Ensemble-SMOTE outperformed other predictive models, and class imbalance treatment methods by achieving the highest average accuracy (87.24), recall (92.50%), F1-score (91.30%), and F2-score (92.02%) from 6th until 10th trimester. To assess the effectiveness of class imbalance treatment methods, Friedman test is performed to determine on significant difference between the models after applying Shapiro-Wilk test in normality test. Consequently, Ensemble-SMOTE is ranked as the top-performers by achieving the lowest value in the average rank based on the performance metrics. Additional research could incorporate and examine more complicated approaches in mitigating class imbalance when the dataset is highly imbalanced.

  • Research Article
  • 10.3389/frai.2025.1682919
Evaluating XAI techniques under class imbalance using CPRD data
  • Nov 13, 2025
  • Frontiers in Artificial Intelligence
  • Teena Rai + 7 more

IntroductionThe need for eXplainable Artificial Intelligence (XAI) in healthcare is more critical than ever, especially as regulatory frameworks such as the European Union Artificial Intelligence (EU AI) Act mandate transparency in clinical decision support systems. Post hoc XAI techniques such as Local Interpretable Model-Agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP) and Partial Dependence Plots (PDPs) are widely used to interpret Machine Learning (ML) models for disease risk prediction, particularly in tabular Electronic Health Record (EHR) data. However, their reliability under real-world scenarios is not fully understood. Class imbalance is a common challenge in many real-world datasets, but it is rarely accounted for when evaluating the reliability and consistency of XAI techniques.MethodsIn this study, we design a comparative evaluation framework to assess the impact of class imbalance on the consistency of model explanations generated by LIME, SHAP, and PDPs. Using UK primary care data from the Clinical Practice Research Datalink (CPRD), we train three ML models: XGBoost (XGB), Random Forest (RF), and Multi-layer Perceptron (MLP), to predict lung cancer risk and evaluate how interpretability is affected under class imbalance when compared against a balanced dataset. To our knowledge, this is the first study to evaluate explanation consistency under class imbalance across multiple models and interpretation methods using real-world clinical data.ResultsOur main finding is that class imbalance in the training data can significantly affect the reliability and consistency of LIME and SHAP explanations when evaluated against models trained on balanced data. To explain these empirical findings, we also present a theoretical analysis of LIME and SHAP to understand why explanations change under different class distributions. It is also found that PDPs exhibit noticeable variation between models trained on imbalanced and balanced datasets with respect to clinically relevant features for predicting lung cancer risk.DiscussionThese findings highlight a critical vulnerability in current XAI techniques, i.e., their interpretability are significantly affected under skewed class distributions, which is common in medical data and emphasises the importance of consistent model explanations for trustworthy ML deployment in healthcare.

  • Research Article
  • Cite Count Icon 1
  • 10.3389/fneur.2025.1716984
Explainable machine learning for stroke risk prediction: a comparative study with SHAP-based interpretation
  • Jan 12, 2026
  • Frontiers in Neurology
  • Xiaoyu Tang + 3 more

BackgroundStroke is one of the leading causes of death and disability worldwide, making early screening and risk prediction crucial. Traditional methods have limitations in handling nonlinear relationships between variables, class imbalance, and model interpretability.MethodsLogistic regression (LR), random forest (RF), extreme gradient boosting (XGBoost), categorical boosting (CatBoost), multi-layer perceptron (MLP) neural network, and ensemble models were constructed and compared. Their performance in stroke risk prediction was systematically evaluated, and feature contributions were interpreted using SHapley Additive exPlanations (SHAP). Confusion matrices and Precision-Recall (PR) curves were used to compare the differences in recognition of the positive class (stroke patients) among the models, and training time was calculated to quantify resource consumption.ResultsThe ensemble model and neural network demonstrated superior overall predictive ability to traditional algorithms, with the MLP performing particularly well in terms of recall. SHAP results revealed that “hypertension,” “average blood glucose level,” and “age” were key influencing factors. Confusion matrices and PR curves indicated differences in positive classification among the models. Training time analysis provided a basis for resource assessment for subsequent deployment.ConclusionMachine learning methods have advantages in stroke risk prediction. Incorporating interpretability analysis can enhance the clinical credibility of the models, providing data and methodological reference for stroke risk stratification management and early warning.

  • Conference Article
  • Cite Count Icon 20
  • 10.1109/icmla55696.2022.00224
Informative Evaluation Metrics for Highly Imbalanced Big Data Classification
  • Dec 1, 2022
  • John Hancock + 2 more

We conduct experiments that show the Area Under the Precision Recall Curve (AUPRC) metric provides a more meaningful insight into the impact of Random Undersampling than Area Under the Receiver Operating Characteristic Curve (AUC). Evaluating experiments with multiple metrics is a robust method for overcoming challenges in Machine Learning, such as class imbalance. Random Undersampling is a technique to deal with class imbalance. We find Random Undersampling may provide an improvement to AUC scores. However, at the same time, Random Undersampling may be detrimental to AUPRC scores. AUPRC is a metric that involves precision, whereas AUC does not. In the classification of imbalanced Big Data, an increase in false positive counts has a more noticeable drop in precision scores. Therefore, in application domains where false positives are undesirable, optimizing models for AUPRC is a wise choice. Our contribution is to compare the performance of models in terms of AUPRC and AUC to show the impact of Random Undersampling on the classification of imbalanced Big Data. We compare the performance via experiments in the classification of highly imbalanced Big Data. Models are built with data in its original class ratio, and with data undersampled into 5 distinct class ratios. We report the results of 600 experiments where we apply Random Undersampling to a dataset with about 175 million instances. To the best of our knowledge we are the first to utilize Medicare Part D data which became available in 2021.

  • Research Article
  • Cite Count Icon 50
  • 10.1007/s00330-024-10834-0
Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance.
  • Jun 11, 2024
  • European radiology
  • Candelaria Mosquera + 4 more

This work aims to assess standard evaluation practices used by the research community for evaluating medical imaging classifiers, with a specific focus on the implications of class imbalance. The analysis is performed on chest X-rays as a case study and encompasses a comprehensive model performance definition, considering both discriminative capabilities and model calibration. We conduct a concise literature review to examine prevailing scientific practices used when evaluating X-ray classifiers. Then, we perform a systematic experiment on two major chest X-ray datasets to showcase a didactic example of the behavior of several performance metrics under different class ratios and highlight how widely adopted metrics can conceal performance in the minority class. Our literature study confirms that: (1) even when dealing with highly imbalanced datasets, the community tends to use metrics that are dominated by the majority class; and (2) it is still uncommon to include calibration studies for chest X-ray classifiers, albeit its importance in the context of healthcare. Moreover, our systematic experiments confirm that current evaluation practices may not reflect model performance in real clinical scenarios and suggest complementary metrics to better reflect the performance of the system in such scenarios. Our analysis underscores the need for enhanced evaluation practices, particularly in the context of class-imbalanced chest X-ray classifiers. We recommend the inclusion of complementary metrics such as the area under the precision-recall curve (AUC-PR), adjusted AUC-PR, and balanced Brier score, to offer a more accurate depiction of system performance in real clinical scenarios, considering metrics that reflect both, discrimination and calibration performance. This study underscores the critical need for refined evaluation metrics in medical imaging classifiers, emphasizing that prevalent metrics may mask poor performance in minority classes, potentially impacting clinical diagnoses and healthcare outcomes. Common scientific practices in papers dealing with X-ray computer-assisted diagnosis (CAD) systems may be misleading. We highlight limitations in reporting of evaluation metrics for X-ray CAD systems in highly imbalanced scenarios. We propose adopting alternative metrics based on experimental evaluation on large-scale datasets.

  • Conference Article
  • Cite Count Icon 18
  • 10.1109/spac.2017.8304290
Feature selection for high dimensional imbalanced class data based on F-measure optimization
  • Dec 1, 2017
  • Chunkai Zhang + 6 more

Feature selection is designed to eliminate redundant attributes and improve classification accuracy. This is a challenging problem, especially in the case of imbalanced data. The traditional feature selection methods ignores the problem of class imbalance, making the selected features biased towards the majority class and neglecting the significant features for the minority class. Due to the advantage of F-measure in imbalanced data classification, we propose to use F-measure rather than accuracy as the optimization target in feature selection algorithm. This paper introduces a novel feature selection method SSVM-FS which is based on an optimal F-measure structural support vector machine classifier. Features will be selected according to the weight vector of SSVM which takes class imbalance problem into account. Based on this, we developed a comprehensive feature ranking method which integrate weight vector of SSVM and symmetric uncertainty. We use the comprehensive score to reduce the features to a suitable size and then use a harmony search to find the optimal combination of features to predict the target class label. The feature subset selected by the proposed method can represent both majority and minority class, in addition, it is less redundant. The experimental results on six high dimensional class imbalanced microarray data sets show that this method is a better method to solve the unbalanced classification.

  • Research Article
  • Cite Count Icon 34
  • 10.1016/j.dim.2023.100064
Adaptive K-means clustering based under-sampling methods to solve the class imbalance problem
  • Dec 30, 2023
  • Data and Information Management
  • Qian Zhou + 1 more

Adaptive K-means clustering based under-sampling methods to solve the class imbalance problem

  • Book Chapter
  • Cite Count Icon 2
  • 10.1016/b978-0-323-95462-4.00014-5
Chapter 14 - Class imbalance and its impact on predictive models for binary classification of disease: a comparative analysis
  • Jan 1, 2024
  • Artificial Intelligence and Image Processing in Medical Imaging
  • Mubarak Taiwo Mustapha + 1 more

Chapter 14 - Class imbalance and its impact on predictive models for binary classification of disease: a comparative analysis

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 48
  • 10.1007/s10844-023-00793-1
A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method
  • May 16, 2023
  • Journal of Intelligent Information Systems
  • Nasraldeen Alnor Adam Khleel + 1 more

Software defect prediction (SDP) plays a vital role in enhancing the quality of software projects and reducing maintenance-based risks through the ability to detect defective software components. SDP refers to using historical defect data to construct a relationship between software metrics and defects via diverse methodologies. Several prediction models, such as machine learning (ML) and deep learning (DL), have been developed and adopted to recognize software module defects, and many methodologies and frameworks have been presented. Class imbalance is one of the most challenging problems these models face in binary classification. However, When the distribution of classes is imbalanced, the accuracy may be high, but the models cannot recognize data instances in the minority class, leading to weak classifications. So far, little research has been done in the previous studies that address the problem of class imbalance in SDP. In this study, the data sampling method is introduced to address the class imbalance problem and improve the performance of ML models in SDP. The proposed approach is based on a convolutional neural network (CNN) and gated recurrent unit (GRU) combined with a synthetic minority oversampling technique plus the Tomek link (SMOTE Tomek) to predict software defects. To establish the efficiency of the proposed models, the experiments have been conducted on benchmark datasets obtained from the PROMISE repository. The experimental results have been compared and evaluated in terms of accuracy, precision, recall, F-measure, Matthew’s correlation coefficient (MCC), the area under the ROC curve (AUC), the area under the precision-recall curve (AUCPR), and mean square error (MSE). The experimental results showed that the proposed models predict the software defects more effectively on the balanced datasets than the original datasets, with an improvement of up to 19% for the CNN model and 24% for the GRU model in terms of AUC. We compared our proposed approach with existing SDP approaches based on several standard performance measures. The comparison results demonstrated that the proposed approach significantly outperforms existing state-of-the-art SDP approaches on most datasets.

  • Book Chapter
  • Cite Count Icon 15
  • 10.1007/978-3-319-57351-9_11
Classification of Imbalanced Auction Fraud Data
  • Jan 1, 2017
  • Swati Ganguly + 1 more

Online auctioning has attracted serious fraud given the huge amount of money involved and anonymity of users. In the auction fraud detection domain, the class imbalance, which means less fraud instances are present in bidding transactions, negatively impacts the classification performance because the latter is biased towards the majority class i.e. normal bidding behavior. The best-designed approach to handle the imbalanced learning problem is data sampling that was found to improve the classification efficiency. In this study, we utilize a hybrid method of data over-sampling and under-sampling to be more effective in addressing the issue of highly imbalanced auction fraud datasets. We deploy a set of well-known binary classifiers to understand how the class imbalance affects the classification results. We choose the most relevant performance metrics to deal with both imbalanced data and fraud bidding data.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2791
  • 10.1186/s40537-019-0192-5
Survey on deep learning with class imbalance
  • Mar 19, 2019
  • Journal of Big Data
  • Justin M Johnson + 1 more

The purpose of this study is to examine existing deep learning techniques for addressing class imbalanced data. Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g., fraud detection and cancer detection. Moreover, highly imbalanced data poses added difficulty, as most learners will exhibit bias towards the majority class, and in extreme cases, may ignore the minority class altogether. Class imbalance has been studied thoroughly over the last two decades using traditional machine learning models, i.e. non-deep learning. Despite recent advances in deep learning, along with its increasing popularity, very little empirical work in the area of deep learning with class imbalance exists. Having achieved record-breaking performance results in several complex domains, investigating the use of deep neural networks for problems containing high levels of class imbalance is of great interest. Available studies regarding class imbalance and deep learning are surveyed in order to better understand the efficacy of deep learning when applied to class imbalanced data. This survey discusses the implementation details and experimental results for each study, and offers additional insight into their strengths and weaknesses. Several areas of focus include: data complexity, architectures tested, performance interpretation, ease of use, big data application, and generalization to other domains. We have found that research in this area is very limited, that most existing work focuses on computer vision tasks with convolutional neural networks, and that the effects of big data are rarely considered. Several traditional methods for class imbalance, e.g. data sampling and cost-sensitive learning, prove to be applicable in deep learning, while more advanced methods that exploit neural network feature learning abilities show promising results. The survey concludes with a discussion that highlights various gaps in deep learning from class imbalanced data for the purpose of guiding future research.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant