Prostate Cancer Detection Using Gradient Boosting Machines Effectively
Prostate cancer remains a leading cause of cancer-related deaths among men globally, emphasizing the critical need for accurate diagnostic tools. This study investigates the application of Gradient Boosting Machines (GBMs) for prostate cancer detection using a dataset with key tumor characteristics such as radius, texture, area, and symmetry. Data preprocessing included normalization, missing value handling, and the Synthetic Minority Oversampling Technique (SMOTE) to address class imbalance. The GBM model demonstrated an accuracy of 75%, with high precision (82%) and recall (88%) for malignant cases, underscoring its potential as a reliable diagnostic tool. However, the model's performance for benign cases was limited by severe class imbalance, reflected in a precision of 33% and recall of 25%. Interpretability was enhanced using SHAP values, identifying key predictors like tumor perimeter and compactness. While GBMs show promise in prostate cancer diagnostics, future research should incorporate multimodal data, advanced balancing techniques, and rigorous validation frameworks to enhance generalizability and fairness. This study highlights the value of machine learning in healthcare, contributing to improved diagnostic accuracy and patient outcomes.
- Research Article
18
- 10.1016/j.eswa.2021.115658
- Jul 26, 2021
- Expert Systems with Applications
Diagnosing a parasitic disease is a very difficult job in clinical practice. In this study, we constructed a machine learning model for diagnosis prediction using patient information. First, we diagnosed whether a patient has a parasitic disease. Next, we predicted the proper diagnosis method among the six types of diagnostic terms (biopsy, endoscopy, microscopy, molecular, radiology, and serology) if the patient has a parasitic disease. To make the datasets, we extracted patient information from PubMed abstracts from 1956 to 2019. We then used two datasets: the prediction for parasite-infected patient dataset (N = 8748) and the prediction for diagnosis method dataset (N = 3780). We then compared four machine learning models: support vector machine, random forest, multi-layered perceptron, and gradient boosting. To solve the data imbalance problem, the synthetic minority over-sampling technique and TomekLinks were used. In the parasite-infected patient dataset, the random forest, random forest with synthetic minority over-sampling technique, gradient boosting, gradient boosting with synthetic minority over-sampling technique, and gradient boosting with TomekLinks demonstrated the best performances (AUC: 79%). In predicting the diagnosis method dataset, gradient boosting with synthetic minority over-sampling technique was the best model (AUC: 87%). For the class prediction, gradient boosting demonstrated the best performances in biopsy (AUC: 88%). In endoscopy (AUC: 94%), molecular (AUC: 90%), and radiology (AUC: 88%), gradient boosting with synthetic minority over-sampling technique demonstrated the best performance. Random forest demonstrated the best performances in microscopy (AUC: 82%) and serology (AUC: 85%). We calculated feature importance using gradient boosting; age was the highest feature importance. In conclusion, this study demonstrated that gradient boosting with synthetic minority over-sampling technique can predict a parasitic disease and serve as a promising diagnosis tool for binary classification and multi-classification schemes.
- Research Article
- 10.11591/ijeecs.v39.i2.pp1130-1144
- Aug 1, 2025
- Indonesian Journal of Electrical Engineering and Computer Science
Effective spam detection is essential for data security, user experience, and organizational trust. However, outliers and class imbalance can impact machine learning models for spam classification. Previous studies focused on feature selection and ensemble learning but have not explicitly examined their combined effects. This study evaluates the performance of random forest (RF), gradient boosting (GB), and extreme gradient boosting (XGBoost) under four experimental scenarios: (i) without synthetic minority over-sampling technique (SMOTE) and outliers, (ii) without SMOTE but with outliers, (iii) with SMOTE and without outliers, and (iv) with SMOTE and with outliers. Results show that XGBoost achieves the highest accuracy (96%), an area under the curve-receiver operating characteristic (AUCROC) of 0.9928, and the fastest computation time (0.6184 seconds) under the SMOTE and outlier-free scenario. Additionally, RF attained an AUCROC of 0.9920, while GB achieved 0.9876 but required more processing time. These findings emphasize the need to address class imbalance and outliers in spam detection models. This study contributes to developing more robust spam filtering techniques and provides a benchmark for future improvements. By systematically evaluating these factors, it lays a foundation for designing more effective spam detection frameworks adaptable to real-world imbalanced and noisy data conditions.
- Research Article
- 10.3390/s25185628
- Sep 9, 2025
- Sensors (Basel, Switzerland)
This study investigates the feasibility of using wearable technology and machine learning algorithms to predict academic performance based on physiological signals. It also examines the correlation between stress levels, reflected in the collected physiological data, and academic outcomes. To this aim, six key physiological signals, including skin conductance, heart rate, skin temperature, electrodermal activity, blood volume pulse, inter-beat interval, and accelerometer were recorded during three examination sessions using a wearable device. A novel pipeline, comprising data preprocessing and feature engineering, is proposed to prepare the collected data for training machine learning algorithms. We evaluated five machine learning models, including Random Forest, Support Vector Machine (SVM), eXtreme Gradient Boosting (XGBoost), Categorical Boosted (CatBoost), and Gradient-Boosting Machine (GBM), to predict the exam outcomes. The Synthetic Minority Oversampling Technique (SMOTE), followed by hyperparameter tuning and dimensionality reduction, are implemented to optimise model performance and address issues like class imbalance and overfitting. The results obtained by our study demonstrate that physiological signals can effectively predict stress and its impact on academic performance, offering potential for real-time monitoring systems that support student well-being and academic success.
- Research Article
- 10.52783/jisem.v10i30s.4837
- Mar 29, 2025
- Journal of Information Systems Engineering and Management
Chronic Kidney Disease (CKD) is a degenerative disorder that offers a huge worldwide health threat, frequently resulting in severe complications if not recognized early. Traditional diagnostic methods can be time-consuming, resource-intensive, and subject to human error. With the growth of artificial intelligence in healthcare, deep learning algorithms have emerged as strong tools for accurately and efficiently detecting CKD. This study compares two major deep learning model Convolutional Neural Networks (CNN) and Long Short-Term Memory networks (LSTM) for the early detection and categorization of CKD. The models were tested with and without the Synthetic Minority Over-sampling Technique (SMOTE) to resolve data imbalances. Performance criteria such as accuracy, precision, recall, and F1-score were employed for evaluation. CNN with SMOTE had the best performance, with an accuracy of 99%, precision of 99%. In contrast, LSTM with SMOTE had 91% accuracy,89% precision. Also table highlights overall model performance, and shows class wise accuracy for detecting Normal, Cyst, Stone, and Tumor instances, with CNN with SMOTE outperforming LSTM with SMOTE in all classes. Our data demonstrate the efficacy of CNN, particularly when paired with SMOTE, in reaching high diagnostic accuracy.
- Research Article
- 10.3390/healthcare13202588
- Oct 14, 2025
- Healthcare
Background: Class imbalance and limited interpretability remain major barriers to the clinical adoption of machine learning in diabetes prediction. These challenges often result in poor sensitivity to high-risk cases and reduced trust in AI-based decision support. This study addresses these limitations by integrating SMOTE-based resampling with SHAP-driven explainability, aiming to enhance both predictive performance and clinical transparency for real-world deployment. Objective: To develop and validate an interpretable machine learning framework that addresses class imbalance through advanced resampling techniques while providing clinically meaningful explanations for enhanced decision support. This study serves as a methodologically rigorous proof-of-concept, prioritizing analytical integrity over scale. While based on a computationally feasible subset of 1500 records, future work will extend to the full 100,000-patient dataset to evaluate scalability and external validity. We used the publicly available, de-identified Diabetes Prediction Dataset hosted on Kaggle, which is synthetic/derivative and not a clinically curated cohort. Accordingly, this study is framed as a methodological proof-of-concept rather than a clinically generalizable evaluation. Methods: We implemented a robust seven-stage pipeline integrating the Synthetic Minority Oversampling Technique (SMOTE) with SHapley Additive exPlanations (SHAP) to enhance model interpretability and address class imbalance. Five machine learning algorithms—Random Forest, Gradient Boosting, Support Vector Machine (SVM), Logistic Regression, and XGBoost—were comparatively evaluated on a stratified random sample of 1500 patient records drawn from the publicly available Diabetes Prediction Dataset (n = 100,000) hosted on Kaggle. To ensure methodological rigor and prevent data leakage, all preprocessing steps—including SMOTE application—were performed within the training folds of a 5-fold stratified cross-validation framework, preserving the original class distribution in each fold. Model performance was assessed using accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, specificity, F1-score, and precision. Statistical significance was determined using McNemar’s test, with p-values adjusted via the Bonferroni correction to control for multiple comparisons. Results: The Random Forest-SMOTE model achieved superior performance with 96.91% accuracy (95% CI: 95.4–98.2%), AUC of 0.998, sensitivity of 99.5%, and specificity of 97.3%, significantly outperforming recent benchmarks (p < 0.001). SHAP analysis identified glucose (SHAP value: 2.34) and BMI (SHAP value: 1.87) as primary predictors, demonstrating strong clinical concordance. Feature interaction analysis revealed synergistic effects between glucose and BMI, providing actionable insights for personalized intervention strategies. Conclusions: Despite promising results, further validation of the proposed framework is required prior to any clinical deployment. At this stage, the study should be regarded as a methodological proof-of-concept rather than a clinically generalizable evaluation. Our framework successfully bridges algorithmic performance and clinical applicability. It achieved high cross-validated performance on a publicly available Kaggle dataset, with Random Forest reaching 96.9% accuracy and 0.998 AUC. These results are dataset-specific and should not be interpreted as clinical performance. External, prospective validation in real-world cohorts is required prior to any consideration of clinical deployment, particularly for personalized risk assessment in healthcare systems.
- Research Article
- 10.3390/cancers17233853
- Nov 30, 2025
- Cancers
Background and Objective: Prostate cancer remains one of the most prevalent and potentially lethal malignancies among men worldwide, and timely and accurate diagnosis, along with the stratification of patients by disease severity, is critical for personalized treatment and improved outcomes for this cancer. One of the tools used for diagnosis is bioinformatics. However, traditional biomarker discovery methods often lack transparency and interpretability, which means that clinicians find it difficult to trust biomarkers for their application in a clinical setting. Methods: This paper introduces a novel approach that leverages Explainable Machine Learning (XML) techniques to identify and prioritize biomarkers associated with different levels of severity of prostate cancer. The proposed XML approach presented in this study incorporates some traditional machine learning (ML) algorithms with transparent models to facilitate understanding of the importance of the characteristics for bioinformatics analysis, allowing for more informed clinical decisions. The proposed method contains the implementation of several ML classifiers, such as Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM), Logistic Regression (LR), and Bagging (Bg); followed by SHAPly values for the XML pipeline. In this study, for pre-processing of missing values, imputation was applied; SMOTE (Synthetic Minority Oversampling Technique) and the Tomek link method were applied to handle the class imbalance problem. The k-fold stratified validation of machine learning (ML) models and SHAP values (SHapley Additive explanations) were used for explainability. Results: This study utilized a novel tissue microarray data set that has 102 patient data comprising prostate cancer and healthy patients. The proposed model satisfactorily identifies genes as biomarkers, with highest accuracy obtained being 81.01% using RF. The top 10 potential biomarkers identified in this study are DEGS1, HPN, ERG, CFD, TMPRSS2, PDLIM5, XBP1, AJAP1, NPM1 and C7. Conclusions: As XML continues to unravel the complexities within prostate cancer datasets, the identification of severity-specific biomarkers is poised at the forefront of precision oncology. This integration paves the way for targeted interventions, improving patient outcomes, and heralding a new era of individualized care in the fight against prostate cancer.
- Research Article
- 10.1093/humrep/deaf097.389
- Jun 1, 2025
- Human Reproduction
Study question What is the impact of using the Synthetic Minority Over-Sampling Technique with machine learning for predicting sperm retrieval outcomes in non-obstructive-azoospermia patients undergoing microTESE? Summary answer Logistic Regression outperformed other models in predicting sperm retrieval outcomes in non-obstructive azoospermia patients undergoing microTESE, achieving high accuracy, sensitivity, specificity, and AUC What is known already Current research suggests that predicting sperm retrieval success in non-obstructive azoospermia (NOA) patients undergoing microTESE remains challenging. Clinical factors such as FSH levels, testicular volume, and histology are often correlated with outcomes, but prediction accuracy is limited by class imbalance, with fewer unsuccessful retrieval cases. Machine learning (ML) models have shown promise in other medical fields, and Synthetic Minority Over-sampling Technique (SMOTE) has been used to address class imbalance in imbalanced datasets. Study design, size, duration This analytical study used retrospective data collected from 114 NOA patients who underwent microTESE at the tertiary IVF center of Vietnam Military Medical University, spanning from January 2018 to January 2020. The data included clinical parameters, endocrine profiles, and histopathological findings. The retrospective study was approved by the Institutional Review Board (IRB). Participants/materials, setting, methods A total of 17 attributes were extracted from the patient data and divided into a 7:3 ratio for training and testing the ML models. The study employed nine ML classification algorithms:Decision Tree, Logistic Regression, Random Forest, XGBoost, AdaBoost, Support Vector Classifier (SVC), Gaussian Naive Bayes, K-Nearest Neighbors (KNN), and Gradient Boosting. Because the overall SRR is 43 in 115 patients, to address class imbalance and overfitting issues, the SMOTE was applied into training dataset. Main results and the role of chance Among the models, Logistic Regression exhibited the best performance with an accuracy of 82.35%, high sensitivity (83.33%), specificity (81.82%), precision (71.43%), F1 score (76.92%), balanced accuracy (82.58%), MCC (0.63), and AUC (0.90), making it the most reliable for predicting sperm retrieval outcomes. Random Forest and XGBoost showed perfect precision (1.00) but struggled with recall (16.67% and 41.67%, respectively), reflecting challenges in detecting unsuccessful retrievals. These models also had low F1 scores (28.57% for Random Forest and 58.82% for XGBoost), highlighting the need for better sensitivity. KNN and Gradient Boosting demonstrated moderate performance with balanced metrics and acceptable AUC values (0.77 and 0.66, respectively). In contrast, Gaussian Naive Bayes performed poorly overall with low accuracy (41.18%), precision (36.67%), and AUC (0.53), underscoring its inadequacy for this task. These findings suggest that Logistic Regression is the optimal model for clinical prediction in microTESE, though further refinement of other models is warranted to improve sensitivity, F1 score, and overall performance. Limitations, reasons for caution The sample size of 114 patients, while providing valuable insights, may not be large enough to generalize the findings to a broader population. While SMOTE was applied to address class imbalance, it does not fully eliminate the risk of overfitting, especially with certain models like Random Forest and XGBoost. Wider implications of the findings The use of SMOTE to address class imbalance enhances model performance in clinical prediction, particularly with limited or imbalanced datasets like NOA patients undergoing microTESE. This approach can improve predictions in locally sourced specialized clinical datasets, improving decision-making. Trial registration number No
- Research Article
- 10.17576/jsm-2025-5406-17
- Jun 30, 2025
- Sains Malaysiana
An accurate and trustworthy prediction model is essential for supporting policy decisions in environmental management concerning water quality prediction. Nonetheless, imbalanced datasets are prevalent in this discipline and hinder identifying crucial ecological factors accurately. This study proposed a novel SMOTE-PCADBSCAN model to enhance the categorisation of water quality data by employing three key components: (i) synthetic minority over-sampling technique (SMOTE), (ii) principal component analysis (PCA), and (iii) density-based spatial clustering of applications with noise (DBSCAN). The minority class was initially augmented using SMOTE, which PCA then decreased the dimensionality. Subsequently, DBSCAN was utilised to generate superior-quality synthetic data by detecting and eliminating extraneous data points. A Malaysia-based multi-class water quality dataset was employed to determine the efficiency of this model. Four different versions of the dataset (Original, SMOTE, SMOTE-DBSCAN, and SMOTE-PCADBSCAN) also utilised five classifier types for the analysis process: (i) decision tree, (ii) random forest, (iii) gradient boosting method, (iv) adaptive boosting, and (v) extreme gradient boosting. Although the original datasets exhibited great accuracy, class imbalance occurred when detecting minority classes. Among the datasets, the metric performances of SMOTE-DBSCAN and SMOTE-PCADBSCAN-based synthetic datasets were superior. The highest accuracy and optimal F1 scores were also demonstrated by RF using the SMOTE-PCADBSCAN approach, which presented excellent water quality classification and imbalanced data management. Consequently, the classification accuracy of imbalanced environmental datasets could be enhanced by employing advanced oversampling techniques and ensemble approaches.
- Research Article
1
- 10.46481/jnsps.2025.2385
- May 1, 2025
- Journal of the Nigerian Society of Physical Sciences
The prevalence of class imbalance is a common challenge in medical datasets, which can adversely affect the performance of machine learning models. This paper explores how several data imbalance mitigation techniques affect the performance of cardiovascular disease prediction. This study applied various data balancing techniques on a real-life cardiovascular disease (CVD) dataset of 1000 patient records with 14 features obtained from the University of Abuja Teaching Hospital Nigeria to address this problem. The data balancing techniques used include random under-sampling, Synthetic Minority Over-sampling Technique (SMOTE), Synthetic Minority Oversampling-Edited Nearest Neighbour (SMOTE-ENN), and the combination of SMOTE and Tomek Links undersampling (SMOTE-TOMEK). After applying these techniques, their performance was evaluated on seven machine learning models, including Random Forest, XGBoost, LightGBM, Gradient Boosting, K-Nearest Neighbours, Decision Tree, and Support Vector Machine. The evaluation metrics used are precision, recall, F1-score, accuracy, and receiver operating characteristic-area under the curve (ROC-AUC). Learning curve plots were also used to showcase the impact of the different data balancing techniques on the challenges of overfitting and underfitting. The results showed that the application of data balancing techniques significantly enhances the performance of machine learning models in heart disease prediction and effectively addresses the challenges of overfitting and underfitting with SMOTE-TOMEK, yielding the best-balanced fit as well as the highest precision, recall, F1-score, accuracy of 92%, and ROC-AUC of 96% on the Lightweight Gradient Boosting Machine (LightGBM) model. These results underscore the critical role of data balancing in predictive modelling for heart disease and highlight the effectiveness of specific techniques and models in achieving accurate, more reliable, and generalised predictions.
- Research Article
1
- 10.5194/nhess-24-1913-2024
- Jun 6, 2024
- Natural Hazards and Earth System Sciences
Abstract. Landslides threaten human life and infrastructure, resulting in fatalities and economic losses. Monitoring stations provide valuable data for predicting soil movement, which is crucial in mitigating this threat. Accurately predicting soil movement from monitoring data is challenging due to its complexity and inherent class imbalance. This study proposes developing machine learning (ML) models with oversampling techniques to address the class imbalance issue and develop a robust soil movement prediction system. The dataset, comprising 2 years (2019–2021) of monitoring data from a landslide in Uttarakhand, has a 70:30 ratio of training and testing data. To tackle the class imbalance problem, various oversampling techniques, including the synthetic minority oversampling technique (SMOTE), K-means SMOTE, borderline-SMOTE, and adaptive SMOTE (ADASYN), were applied to the training dataset. Several ML models, namely random forest (RF), extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), adaptive boosting (AdaBoost), category boosting (CatBoost), long short-term memory (LSTM), multilayer perceptron (MLP), and a dynamic ensemble, were trained and compared for soil movement prediction. A 5-fold cross-validation method was applied to optimize the ML models on the training data, and the models were tested on the testing set. Among these ML models, the dynamic ensemble model with K-means SMOTE performed the best in testing, with an accuracy, precision, and recall rate of 0.995, 0.995, and 0.995, respectively, and an F1 score of 0.995. Additionally, models without oversampling exhibited poor performance in training and testing, highlighting the importance of incorporating oversampling techniques to enhance predictive capabilities.
- Research Article
- 10.54254/2755-2721/2025.ld26477
- Aug 26, 2025
- Applied and Computational Engineering
The global cost of credit card fraud continues to rise, driven by the increasingly concentrated and sophisticated attacks. This situation underscores the necessity for more effective detection and prevention methods. In response to the growing need for better fraud detection and prevention, machine learning has witnessed significant advancements in recent years. This paper provides an overview and comparison of various models. On one hand, there are traditional supervised learning models, such as Logistic Regression, Decision Trees, and Support Vector Machines (SVM). On the other hand, ensemble methods like Random Forest, Gradient Boosting, and XGBoost are also covered. Given the highly imbalanced nature of credit card fraud datasets, the study also examines the impact of the Synthetic Minority Over-sampling Technique (SMOTE) on classification performance. While SMOTE has been shown to improve a models performance for weaker classifiers, its benefits for advanced ensemble methods remain less clear. Consequently, this paper will identify which models benefit most from oversampling and assess whether high-performing classifiers can mitigate the effects of imbalance without the need for data augmentation. When comparing the models performances, Random Forest and XGBoost demonstrated superior performance both with and without SMOTE. Without SMOTE, two models, Logistic Regression and SVM, yielded high accuracy but near-zero performance on key classification metrics, highlighting their inability to effectively detect minority class instances.
- Research Article
- 10.52783/pmj.v35.i4s.4631
- Mar 29, 2025
- Panamerican Mathematical Journal
This study explores on the revenue prediction of e-commerce by applying cutting-edge machine learning techniques and Explainable AI (XAI) frameworks. The class imbalance in the dataset online_shoppers_intention was treated using the Synthetic Minority Over-sampling Technique, SMOTE. The performance of the various models, such as XGBoost (XGBst), Random Forest (RndF), Logistic Regression (L-Reg), Support Vector Machine (SupVM), Decision Tree (D-Tree), k-Nearest Neighbors (kNeigh), Gradient Boosting (GradBst), and a Voting Classifier(VotClf) ensemble were extensively investigated using various performance metrics. GridSearchCV hyperparameter tuning was employed along with feature scaling, and cross-validation to achieve optimal performance of models. The results are compared with as well as without the application of SMOTE. RndF classifier with SMOTE gave the best accuracy of 92.45%, precision of 91.02%, recall of 94.22%, F1-score of 92.59% and AUC-ROC of 97.88% was noted without SMOTE. An XAI model, SHAP, was employed to make the classification model transparent and identify the features contributing to revenue-generation.
- Research Article
- 10.1186/s12889-025-24657-1
- Oct 15, 2025
- BMC Public Health
BackgroundPhysical activity is a key focus in the field of public health, and subjective life expectancy is closely associated with individuals’ physical and psychological well-being. This study aimed to identify the risk factors for subjective life expectancy among middle-aged and older adults with active and inactive physical activity levels, and to provide an evidence base for developing differentiated health intervention strategies. MethodsBased on data from the China Health and Retirement Longitudinal Study (CHARLS) 2018 survey, a total of 10,945 participants were included. Five machine learning models, including Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), were separately constructed for the active and inactive groups. To reduce bias caused by class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to generate synthetic samples for the minority class. The dataset was split into a training set (70%) and a testing set (30%), and ten-fold cross-validation combined with grid search was employed to optimize hyperparameters, ensuring both robustness and generalizability of the models. Model performance was evaluated using the Area Under the Receiver Operating Characteristic Curve (AUC), accuracy, sensitivity, specificity, and F1-score.ResultsThe active group (4,707 men and 4,885 women) had a mean age of 59.76 years, while the inactive group (662 men and 691 women) had a mean age of 63.00 years. The Support Vector Machine (SVM) model achieved the best performance in the inactive group (AUC: 0.797; accuracy: 0.722; sensitivity: 0.747), whereas the Light Gradient Boosting Machine (LightGBM) model achieved the best performance in the active group (AUC: 0.775; accuracy: 0.745; specificity: 0.814). Feature importance analysis indicated that “age” was the most important variable in the Support Vector Machine (SVM) model, while “perceived health” was the most important variable in the Light Gradient Boosting Machine (LightGBM) model.ConclusionMachine learning methods can effectively identify key risk factors influencing subjective life expectancy among middle-aged and older adults, and provide valuable guidance for targeted health management strategies tailored to populations with different levels of physical activity.
- Research Article
- 10.59395/ijadis.v6i3.1465
- Dec 3, 2025
- International Journal of Advances in Data and Information Systems
The rapid expansion of the Internet of Things (IoT) ecosystem has increased its susceptibility to cyberattacks, creating a critical need for reliable Intrusion Detection Systems (IDS). However, IDS performance is often hindered by severe class imbalance, high-dimensional features, and similarities among attack behaviors. This study proposes an optimized XGBoost model enhanced with the Synthetic Minority Over-sampling Technique (SMOTE) and Principal Component Analysis (PCA) to address these challenges. A systematic grid-search procedure was employed to ensure transparency, reproducibility, and optimal hyperparameter selection. The original imbalance ratio of approximately 1:27 was successfully normalized to nearly 1:1 through SMOTE. The Gotham dataset used in this study consists of roughly 350,000 IoT traffic records across eight attack categories. Five data-splitting scenarios (50:50 to 90:10) were evaluated using stratified hold-out validation supported by k-fold cross-validation. The optimized model achieved 99.68% accuracy, while extremely high AUC values approaching 1.0 were carefully validated to eliminate potential data leakage. Naive Bayes, Logistic Regression, Support Vector Machine, and Deep Neural Network were included as baseline comparisons. The results demonstrate that combining SMOTE and PCA significantly improves model stability and generalization on imbalanced IoT traffic, confirming the effectiveness of the proposed XGBSP method.
- Research Article
8
- 10.1016/j.rineng.2024.103233
- Oct 24, 2024
- Results in Engineering
Data augmentation using SMOTE technique: Application for prediction of burst pressure of hydrocarbons pipeline using supervised machine learning models
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.