Exploring the feature prioritization and data sampling of PCOS diagnosis via densely connected attention based squeeze deep learning detection model.
Exploring the feature prioritization and data sampling of PCOS diagnosis via densely connected attention based squeeze deep learning detection model.
- Research Article
87
- 10.1016/j.compag.2022.106804
- Mar 1, 2022
- Computers and Electronics in Agriculture
An explainable XGBoost model improved by SMOTE-ENN technique for maize lodging detection based on multi-source unmanned aerial vehicle images
- Research Article
- 10.37365/jti.v11i2.448
- Nov 7, 2025
- Infotech: Journal of Technology Information
Polycystic Ovary Syndrome (PCOS) is one of the most common hormonal disorders experienced by women of reproductive age and can lead to various health problems, including menstrual irregularities, infertility, and an increased risk of metabolic diseases. Early detection of PCOS is essential to minimize long-term impacts and improve the quality of life for patients. This study aims to identify effective data preprocessing strategies to enhance the performance of classification models for PCOS detection. The dataset used is open source, consisting of 541 participants with 45 clinical and laboratory features. The main challenges encountered include the presence of many missing values, an imbalanced target class distribution, and a large number of independent features. To address these issues, a series of preprocessing steps were applied, including missing value imputation, data balancing using the Synthetic Minority Over-sampling Technique (SMOTE), and dimensionality reduction using Principal Component Analysis (PCA). A classification model was built using the Random Forest algorithm, and its performance was compared before and after applying PCA. The evaluation results show that before PCA, the model achieved an accuracy of 87.5%, precision of 86%, recall of 86%, and an F1-score of 86%. After applying PCA, performance improved to an accuracy of 90%, precision of 89%, recall of 89%, and an F1-score of 89%. These findings indicate that the right combination of preprocessing strategies, particularly SMOTE and PCA, can significantly improve the efficiency and effectiveness of models in detecting PCOS, thereby supporting the development of more reliable medical decision support systems.
- Research Article
94
- 10.3390/diagnostics13081506
- Apr 21, 2023
- Diagnostics
Polycystic ovary syndrome (PCOS) has been classified as a severe health problem common among women globally. Early detection and treatment of PCOS reduce the possibility of long-term complications, such as increasing the chances of developing type 2 diabetes and gestational diabetes. Therefore, effective and early PCOS diagnosis will help the healthcare systems to reduce the disease’s problems and complications. Machine learning (ML) and ensemble learning have recently shown promising results in medical diagnostics. The main goal of our research is to provide model explanations to ensure efficiency, effectiveness, and trust in the developed model through local and global explanations. Feature selection methods with different types of ML models (logistic regression (LR), random forest (RF), decision tree (DT), naive Bayes (NB), support vector machine (SVM), k-nearest neighbor (KNN), xgboost, and Adaboost algorithm to get optimal feature selection and best model. Stacking ML models that combine the best base ML models with meta-learner are proposed to improve performance. Bayesian optimization is used to optimize ML models. Combining SMOTE (Synthetic Minority Oversampling Techniques) and ENN (Edited Nearest Neighbour) solves the class imbalance. The experimental results were made using a benchmark PCOS dataset with two ratios splitting 70:30 and 80:20. The result showed that the Stacking ML with REF feature selection recorded the highest accuracy at 100 compared to other models.
- Research Article
20
- 10.3390/app14219772
- Oct 25, 2024
- Applied Sciences
Predicting survival outcomes in critical accidents has been a focal point in machine learning research. This study addresses several limitations of existing methods, including insufficient management of data imbalance, lack of emphasis on hyperparameter tuning, and proneness to overfitting. Many existing models struggle to generalize effectively on imbalanced datasets or depend on default hyperparameter settings, resulting in biased predictions. By integrating Principal Component Analysis (PCA), hyperparameter optimization, and resampling methods, as well as combining Edited Nearest Neighbors (ENN) with the Synthetic Minority Oversampling Technique (SMOTE), the model significantly improves predictive accuracy and model generalization. An ensemble model combining seven machine learning algorithms—Logistic Regression, Support Vector Machine, KNN, Random Forest, XGBoost, LightGBM, and CatBoost—was applied to predict survival outcomes. Stochastic Weighted Averaging (SWA) was applied to mitigate overfitting and enhance generalization. The accuracy increased from 91.97% to 94.89% after SWA was applied in this specific scenario. The combination of PCA-based dimensionality reduction, hyperparameter tuning, and resampling techniques (ENN + SMOTE) ensured the model handled data imbalance and optimized predictive accuracy. The final model demonstrated excellent performance, with Area Under the Curve (AUC) and Average Precision (AP) values both reaching 0.98, indicating high accuracy and precision. These improvements were validated using the Titanic dataset in a binary classification problem of predicting passenger survival. The results emphasize that ensemble learning, enhanced by SWA, offers a powerful framework for handling imbalanced and complex datasets, providing significant advancements in predictive modeling accuracy. This study provides insights into how machine learning techniques can be effectively combined to solve classification challenges in real-world scenarios.
- Research Article
- 10.21271/zjpas.37.6.13
- Dec 31, 2025
- Zanco Journal of Pure and Applied Sciences
Securing Internet of Things (IoT) networks is an ongoing challenge. As more devices connect to the internet with limited resources, these systems have become more vulnerable to cyberattacks. Many attacks continually evolve and become more sophisticated. This highlights the need for scalable, efficient anomaly detection deployable close to IoT devices to minimize latency, while maintaining high accuracy with low memory and computational demands. Many solutions have been applied for enhancing the problem area, either they are heavy models unsuitable for edge devices or they lack generalizability with recent datasets and current attack traffic patterns. Our research suggests a lightweight anomaly detection model that combines Convolution Neural Network (CNN) and Long Short Term Memory (LSTM) model, to recognize patterns across both spatial and temporal dimensions, as well as identify significant relationships among an interpretable selected set of features. with SHapley Additive exPlanations (SHAP) for feature selection and Synthetic Minority Oversampling Technique - Edited Nearest Neighbors (SMOTE-ENN) for balancing the distribution of classes in the datasets. The model’s performance was evaluated using accuracy, precision, recall, and F1 parameters. Following the study, an accuracy rate of 99.12% for multiclassification is achieved in the CICIoT2023 dataset. In the TON_IoT dataset, a multiclassification success rate of 99.08% is reached. The model with 10 features selected achieved 99.0%, 98.85% in the CICIoT2023 and TON_IoT dataset. With just 43,406 trainable parameters and Top 10 features selected proposed framework offers a lightweight, explainable model that is effective for edge IoT devices with limited resources.
- Research Article
- 10.1007/s41870-025-03044-4
- Dec 27, 2025
- International Journal of Information Technology
Polycystic Ovary Syndrome (PCOS) is a common endocrine condition that needs accurate diagnosis for effective management. It involves the presence of numerous immature follicles in the ovaries, which can interfere with healthy ovulation and lead to hormonal imbalances and other health issues. Consequently, it is essential to establish a PCOS detection system that is both precise and timely to lower complications. In the current literature, Machine Learning (ML) models have demonstrated their efficacy in detecting PCOS. However, the accurate and early detection of PCOS requires the precise identification of key features. This paper proposes a hybrid framework for PCOS prediction that combines ensemble learning and feature selection. The proposed methodology integrates Genetic Algorithm (GA), Mutual Information (MI), and Boruta feature selection techniques to identify the most informative clinical and hormonal features. In addition, to facilitate a comparative evaluation of prediction performance, a variety of base and ensemble classifiers were trained with selected features. The hybrid feature set improved diagnostic accuracy and generalizability across models, establishing a comprehensible and effective method for PCOS identification that is suitable for clinical decision support. Additionally, SHAP-based feature interpretation is performed to assess the contributions of each feature. The proposed method is evaluated on a publicly available PCOS dataset. It exhibits superior performance compared to several existing approaches, achieving an accuracy of over 94% on all different combinations of feature sets and XGBoost.
- Research Article
1
- 10.36893/iej.2024.v53i12.022
- Jan 1, 2024
- Industrial Engineering Journal
The healthcare fraud detection field is constantly evolving and faces significant challenges, particularly when addressing imbalanced data issues. Previous studies mainly focused on traditional machine learning (ML) techniques, often struggling with imbalanced data. This problem arises in various aspects. It includes the risk of overfitting with Random Oversampling (ROS), noise introduction by the Synthetic Minority Oversampling Technique (SMOTE), and potential crucial information loss with Random Undersampling (RUS). Moreover, improving model performance, exploring hybrid resampling techniques, and enhancing evaluation metrics are crucial for achieving higher accuracy with imbalanced datasets. In this paper, we present a novel approach to tackle the issue of imbalanced datasets in healthcare fraud detection, with a specific focus on the Medicare Part B dataset. First, we carefully extract the categorical feature ‘‘Provider Type’’ from the dataset. This allows us to generate new, synthetic instances by randomly replicating existing types, thereby increasing the diversity within the minority class. Then, we apply a hybrid resampling method named SMOTE-ENN, which combines the Synthetic Minority Over-sampling Technique (SMOTE) with Edited Nearest Neighbors (ENN). This method aims to balance the dataset by generating synthetic samples and removing noisy data to improve the accuracy of the models. We use six machine learning (ML) models to categorize the instances. When evaluating performance, we rely on common metrics like accuracy, F1 score, recall, precision, and the AUC-ROC curve. We highlight the significance of the Area Under the Precision-Recall Curve (AUPRC) for assessing performance in imbalanced dataset scenarios. The experiments show that Decision Trees (DT) outperformed all the classifiers, achieving a score of 0.99 across all metrics.
- Conference Article
60
- 10.1109/ccwc51732.2021.9375994
- Jan 27, 2021
PolyCystic Ovary Syndrome (PCOS) is one of the most common causes of female infertility, affecting a large number of women of reproductive age, even continuing far beyond the childbearing years. This hormonal disorder may further lead to the risk of other long-term complications. Considering the powerful recognition abilities of the probabilistic nature of ensemble-based gradient boosting algorithms, particularly in the field of the medical domain, we propose the use of Extreme Gradient Boosting, XGBoost, for early detection of PCOS. To strongly support an effective classification performance, we have resampled our data using a combination of SMOTE(Synthetic Minority Oversampling Techniques) & ENN (Edited Nearest Neighbour), to solve class imbalance and data outliers issues. Also, by exploiting popular statistical correlation methods, ANOVA Test Chi-Square Test, we have identified 23 most significant metabolic and clinical parameters that best classify PCOS conditions. Finally, we experimented with our model on a benchmark dataset collected from Kaggle to justify the effectiveness of our proposed findings where the Extreme Gradient Boosting classifier outperformed all other classifiers with a 10 Fold Cross-validation score of 96.03 % all over, along with a 98% Recall in the detection of patients not having PCOS, which outperforms all the existing recent methods where the numerical data-driven diagnosis of PCOS have been studied on this particular dataset.
- Research Article
2
- 10.1002/cem.70029
- Apr 20, 2025
- Journal of Chemometrics
ABSTRACTIn critical domains including medicinal chemistry, biomedicine, metabolomics, and computational toxicology, class imbalance in datasets and poor recognition accuracy for minority classes remain persistent challenges. While previous studies have employed resampling and feature selection techniques to address data imbalance and enhance classification performance, most approaches have focused on single‐algorithm solutions rather than hybrid methodologies. Hybrid algorithms offer distinct advantages by integrating the strengths of multiple techniques, thereby providing more comprehensive and efficient solutions for handling imbalanced data. This study proposes HiBBKA, a novel hybrid algorithm combining radial‐based under‐sampling with SMOTE (RBU‐SMOTE) and an improved binary black‐winged kite algorithm (iBBKA) for feature selection. The proposed framework operates through two key phases: First, the RBU‐SMOTE resampling method synergistically integrates radial‐based under‐sampling (RBU) with the synthetic minority oversampling technique (SMOTE), effectively addressing class‐imbalance distribution while enhancing the quality of synthesized samples. Second, the enhanced iBBKA feature selection algorithm systematically identifies the most discriminative features critical for classification tasks. We comprehensively evaluate RBU‐SMOTE and HiBBKA using multiple classifiers across 16 imbalanced datasets, including real‐world medical datasets, with particular emphasis on the minority class performance. Experimental results demonstrate that RBU‐SMOTE achieves competitive performance compared to existing resampling methods, while the complete HiBBKA framework significantly outperforms state‐of‐the‐art algorithms in overall classification metrics, particularly in the minority class recognition.
- Research Article
1
- 10.35882/ijeeemi.v7i2.77
- Apr 23, 2025
- Indonesian Journal of Electronics, Electromedical Engineering, and Medical Informatics
Monkeypox is a zoonotic disease with increasing global prevalence, posing a significant challenge in healthcare. Its widespread transmission necessitates more accurate detection systems to assist medical professionals in diagnosing and managing cases effectively. One of the main challenges in developing monkeypox prediction models is class imbalance in datasets, which can cause models to favor the majority class and reduce predictive accuracy for rarer cases. To address this issue, this study evaluates the effectiveness of the SMOTEENN resampling technique in improving the classification performance of monkeypox cases. Three boosting algorithms Gradient Boosting, XGBoost, and LightGBM were applied to a monkeypox dataset consisting of 25,000 samples. The data preprocessing steps included handling missing values, feature encoding, and feature scaling. The dataset was then balanced using SMOTEENN, a hybrid technique combining the Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbors (ENN). Additionally, hyperparameter tuning with GridSearchCV was performed to optimize model performance by systematically selecting the best parameter combinations. The results indicate that applying SMOTEENN significantly improved classification accuracy, achieving a maximum of 69%, with an F1-score of 67%. Compared to previous studies, the proposed approach demonstrated superior performance in handling class imbalance and enhancing classification robustness. These findings highlight the potential of SMOTEENN and boosting algorithms in medical data classification, particularly for infectious diseases with imbalanced datasets. This study contributes to the development of more reliable machine learning techniques for improving disease detection, classification accuracy, and overall model generalization. Future research should explore additional resampling techniques, deep learning architectures, and feature selection methods to further improve predictive performance in medical diagnostics.
- Research Article
- 10.35882/nrgqsz63
- Apr 23, 2025
- Indonesian Journal of Electronics, Electromedical Engineering, and Medical Informatics
Monkeypox is a zoonotic disease with increasing global prevalence, posing a significant challenge in healthcare. Its widespread transmission necessitates more accurate detection systems to assist medical professionals in diagnosing and managing cases effectively. One of the main challenges in developing monkeypox prediction models is class imbalance in datasets, which can cause models to favor the majority class and reduce predictive accuracy for rarer cases. To address this issue, this study evaluates the effectiveness of the SMOTEENN resampling technique in improving the classification performance of monkeypox cases. Three boosting algorithms Gradient Boosting, XGBoost, and LightGBM were applied to a monkeypox dataset consisting of 25,000 samples. The data preprocessing steps included handling missing values, feature encoding, and feature scaling. The dataset was then balanced using SMOTEENN, a hybrid technique combining the Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbors (ENN). Additionally, hyperparameter tuning with GridSearchCV was performed to optimize model performance by systematically selecting the best parameter combinations. The results indicate that applying SMOTEENN significantly improved classification accuracy, achieving a maximum of 69%, with an F1-score of 67%. Compared to previous studies, the proposed approach demonstrated superior performance in handling class imbalance and enhancing classification robustness. These findings highlight the potential of SMOTEENN and boosting algorithms in medical data classification, particularly for infectious diseases with imbalanced datasets. This study contributes to the development of more reliable machine learning techniques for improving disease detection, classification accuracy, and overall model generalization. Future research should explore additional resampling techniques, deep learning architectures, and feature selection methods to further improve predictive performance in medical diagnostics.
- Research Article
9
- 10.3389/fphys.2025.1435036
- May 6, 2025
- Frontiers in physiology
In the domain of women's health, the intricate conditions of Polycystic Ovary Syndrome (PCOS) demand sophisticated methodologies for accurate identification and intervention. This study introduces an innovative machine learning framework tailored to precisely classify instances of PCOS. The methodology incorporates stacked learning and depends on the Adaptive Synthetic (ADASYN) algorithm, Synthetic Minority Over-sampling Technique (SMOTE), and random oversampling methods for addressing data imbalances. The BORUTA technique is used for feature selection, with the overarching objective of advancing precision and performance metrics in classification tasks. Within the scope of PCOS classification, the proposed framework achieves a commendable 97% accuracy. These results underscore the proficiency of the proposed framework in discriminating PCOS cases with a high degree of precision. Critical to this contribution is the rigorous comparative analysis against existing methodologies, affirming the superior accuracy and performance attributes of the proposed framework. This substantiates its potential as a transformative tool in medical classification. Moreover, beyond immediate applications, this paper explores the generalization of the proposed framework, demonstrating its adaptability and efficacy across different medical classifications. This versatility is exemplified by its successful application to cervical cancer, showcasing the framework potential as a pioneering force in reshaping the landscape of machine-learning applications in healthcare diagnostics.
- Research Article
- 10.1186/s13048-026-01981-7
- Jan 20, 2026
- Journal of ovarian research
There has been little research on the association of exposure to environmental factors on polycystic ovary syndrome (PCOS), nor on the interaction between environmental factors and liver and kidney function. Anti-mullerian hormone (AMH) has been proposed to add significance to diagnosis of PCOS in case of ambiguity. We hypothesize that long-term inhalation exposure to environmentally relevant levels of these factors may induce changes in hepatic and renal function, thereby exacerbating the risk of developing PCOS. The study used a cross-sectional study. Cases were newly diagnosed PCOS patients from a tertiary hospital. Controls were age - and BMI - matched healthy women recruited from the same communities. Data on age and various blood test results were collected from medical records. Meteorological factors and air pollutants were obtained from the National Oceanic and Atmospheric Administration (NOAA). After feature selection, we employed logistic regression, weighted quantile sum (WQS) regression, and neural network models to analyze the associations between relevant variables and the risk characteristics and prediction of PCOS including different aged groups. There were 384 subjects in this retrospective study, randomly including 178 PCOS patients and 206 controls. The levels of most sexual function (FSH, LH, PRL, T, AMH) and liver function indicators (TP, Alb, A/G, ALP, PA, TBA) in PCOS patients were significantly higher than those in the control group. Overall, the AMH level in the PCOS population was 1.133 times that of the non-affected population (95% confidence interval [CI]: 1.077, 1.192). Within the 21-35 years age group, the levels of air pressure and albumin in PCOS patients were 1.060 (95% CI: 1.028, 1.093) and 1.098 (95% CI: 1.002, 1.204) times higher, respectively, than in the non-affected population. Based on the results obtained from the stratified analysis, we incorporated several variables into the prediction model, namely PM₂.₅, air pressure, FSH, PRL, T, AMH, Alb and PA. The overall population demonstrated good PCOS predictive performance in internal validation using the neural network model (test AUC = 0.864, train AUC = 0.992; test R² = 0.342, train R² = 0.910). Significant elevations in levels of AMH and Alb were detected in women with PCOS. The back-propagation (BP) neural network demonstrated good PCOS predictive performance for the models mediated by environmental factors (PM₂.₅, air pressure). This suggests that these factors may probably exacerbate the effects of sexual function (FSH, PRL, T, AMH) and liver function indicators (Alb, PA) on the risk of developing PCOS. Our results support a potential association between environmental factor exposure and the consequences of PCOS in women.
- Research Article
- 10.25139/inform.v10i1.9231
- Feb 1, 2025
- Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi
Polycystic Ovary Syndrome (PCOS) is a hormonal disorder affecting women of reproductive age, with a global prevalence rate of 8–13%. However, approximately 70% of cases remain undiagnosed. This study aimed to develop and compare eight Random Forest classification models for PCOS detection using a publicly available Kaggle dataset. The methodology incorporated three key preprocessing techniques: outlier handling using the Interquartile Range (IQR) method, feature selection through Mutual Information, and class imbalance via SMOTE-Tomek. The results revealed that the best-performing model, which applied outlier removal and SMOTE without feature selection, achieved an accuracy of 94.11%. This result significantly outperformed the baseline Random Forest model, which achieved an accuracy of 87.27% without the application of any preprocessing techniques, such as outlier removal, SMOTE, or feature selection. Moreover, the model utilizing only SMOTE for class balancing achieved an accuracy of 93.84%, underscoring the importance of addressing class imbalance in enhancing classification performance. Notably, feature selection did not consistently improve accuracy, as Random Forest inherently handles feature redundancy, capturing complex feature interactions. These findings highlight the importance of tailored preprocessing strategies, particularly outlier handling and class balancing, for optimizing medical data classification. Future research should explore clinically informed feature selection techniques and assess the generalizability of these findings across diverse datasets to enhance the clinical relevance of PCOS detection models.
- Research Article
3
- 10.56919/usci.2123.011
- Mar 30, 2023
- UMYU Scientifica
Stroke disease is a serious cause of death globally. Early predictions of the disease will save a lot of lives but most of the clinical datasets are imbalanced in nature including the stroke dataset, making the predictive algorithms biased towards the majority class. The objective of this research is to compare different data resampling algorithms on the stroke dataset to improve the prediction performances of the machine learning models. This paper considered five (5) resampling algorithms namely; Random over Sampling (ROS), Synthetic Minority oversampling Technique (SMOTE), Adaptive Synthetic (ADASYN), hybrid techniques like SMOTE with Edited Nearest Neighbor (SMOTE-ENN), and SMOTE with Tomek Links (SMOTE-TOMEK) and trained on six (6) machine learning classifiers namely; Logistic Regression (LR), Decision Tree (DT), K-nearest Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF), and XGBoost (XGB). The hybrid technique SMOTE-ENN influences the machine learning classifiers the best followed by the SMOTE technique while the combination of SMOTE and XGB perform better with an accuracy of 97.99% and G-mean score of 0.99, and auc_roc score of 0.99. Resampling algorithms balance the dataset and enhanced the predictive power of machine learning algorithms. Therefore, we recommend resampling stroke dataset in predicting stroke disease than modeling on the imbalanced dataset.