Articles published on Oversampling Technique
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
4258 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.jns.2026.125846
- Apr 1, 2026
- Journal of the neurological sciences
- Antonio Ianniello + 11 more
Predictors of short-term, relapse-independent progression in multiple sclerosis: A machine learning approach based on clinical data and conventional MRI-derived features.
- Research Article
- 10.3390/jrfm19030210
- Mar 11, 2026
- Journal of Risk and Financial Management
- Irvine Mapfumo + 1 more
Credit risk prediction is essential for financial institutions to effectively assess the likelihood of borrower defaults and manage associated risks. This study presents a comparative analysis of deep learning architectures and traditional machine learning models on imbalanced credit risk datasets. To address class imbalance, we employ three resampling techniques: Synthetic Minority Over-sampling Technique (SMOTE), Edited Nearest Neighbors (ENN), and the hybrid SMOTE-ENN. We evaluate the performance of various models, including multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory (LSTM), gated recurrent unit (GRU), logistic regression, decision tree, support vector machine (SVM), random forest, adaptive boosting, and extreme gradient boosting. The analysis reveals that SMOTE-ENN combined with MLP achieves the highest F1-score of 0.928 (accuracy 95.4%) on the German dataset, while SMOTE-ENN with random forest attains the best F1-score of 0.789 (accuracy 82.1%) on the Taiwanese dataset. SHapley Additive exPlanations (SHAP) are employed to enhance model interpretability, identifying key drivers of credit default. These findings provide actionable guidance for developing transparent, high-performing, and robust credit risk assessment systems.
- Research Article
- 10.1080/24751839.2026.2640249
- Mar 10, 2026
- Journal of Information and Telecommunication
- Antonio Villafranca + 3 more
ABSTRACT Intrusion detection in Internet of Things (IoT) environments presents challenges due to the diversity of connected devices and their resource limitations. IoT networks generate complex, imbalanced traffic where benign activity predominates over attack instances. This imbalance hampers the performance of traditional intrusion detection systems, which struggle to generalize effectively. In this study, we present a deep neural network-based system that leverages advanced data balancing techniques – such as subsampling, Synthetic Minority Over-sampling Technique (SMOTE), and Tomek Links – combined with cross-validation to enhance the model’s generalization and minimize overfitting. Evaluations on CICIDS2017, UNSW-NB15, and BoT-IoT datasets showed accuracy rates of 99.2%, 99.7%, and 99.8%, respectively. These results demonstrate that our methodology outperforms traditional models, especially in detecting minority attack classes, which were previously challenging due to data imbalance. The use of data balancing and cross-validation significantly improved model stability and sensitivity to diverse attack scenarios. Our findings suggest that incorporating these techniques can substantially enhance the security of IoT environments, providing a robust approach for differentiating between normal and malicious activities, thus contributing to more reliable and scalable intrusion detection systems.
- Research Article
- 10.70917/ijcisim-2026-1254
- Mar 5, 2026
- International Journal of Computer Information Systems and Industrial Management Applications
- Sri Rupin Potula + 1 more
With a significant rise in lung cancer cases globally, especially among both men and women, effective lung cancer detection techniques are critically important. This paper addresses the urgency by employing deep learning techniques for lung cancer detection. Our study is based on categorizing the images into three distinct classes: benign, malignant, and normal cases. To ensure uniformity, images of varying sizes are standardized. Addressing the challenge of data imbalance, this paper employs the “Synthetic Minority Over-sampling Technique” (SMOTE), and further enhance image quality through Gaussian Blur in the preprocessing phase. Subsequently, a “Convolutional Neural Network” (CNN) model named “ImageTriNet”, compare its performance with transfer learning models. The ImageTriNet model exhibits commendable results, after 13 training epochs, attaining an accuracy of 0.98, precision of 0.99, recall of 0.96, and an F1-score of 0.97. This research contributes to the ongoing efforts in leveraging deep learning techniques for accurate and timely detection of lung cancer, showcasing efficacy of our ImageTriNet model in this critical domain.
- Research Article
- 10.17780/ksujes.1852101
- Mar 3, 2026
- Kahramanmaraş Sütçü İmam Üniversitesi Mühendislik Bilimleri Dergisi
- Mehmet Reşat Öner + 2 more
This study aims to support reliable ECG signal interpretation by reducing human-dependent variability through computer-aided analysis methods. Machine learning and deep learning methods were employed to examine 2D ECG representations and Synthetic Minority Over-Sampling Technique (SMOTE)-based balancing in ECG classification. Unlike existing ECG classification studies that typically address signal representation and class imbalance separately, this study jointly investigates the interaction between two-dimensional QRS representation and SMOTE-based data balancing within a unified experimental framework, thereby providing a systematic analysis of their combined impact on classification performance. Artificial Neural Networks (ANN), Convolutional Neural Networks (CNN), and K-Nearest Neighbors (KNN) algorithms were implemented and comparatively analyzed. ECG beats from record 108 of the MIT-BIH Arrhythmia dataset were represented in a vision-based form for classification. To address severe class imbalance, SMOTE was applied only to the training data, and its effect on two-dimensional ECG representations was explicitly examined. Normal and Abnormal heartbeats were classified using a stratified 5-fold cross-validation strategy. Experimental results demonstrated that the CNN model achieved the most successful performance after applying SMOTE, reaching a weighted average F1-score of 99.82% ± 0.002, highlighting the combined effectiveness of two-dimensional QRS representation and data balancing in improving automated ECG classification.
- Research Article
- 10.64539/sjcs.v2i1.2026.378
- Mar 3, 2026
- Scientific Journal of Computer Science
- Moshood Abiola Hambali + 3 more
Intrusion Detection Systems (IDS) deal with issues concerning the ever-escalating level of sophistication observed within cyber threats. Nonetheless, IDS performance is deteriorated by class imbalance and excessively high-dimensional features, which cause biased classifier training towards major traffic patterns. Thus, this research introduces an innovative hybrid clustering IDS approach that utilizes MiniBatchKMeans clustering and ensemble machine learning strategies to mitigate these challenges. The suggested IDS approach utilizes the Synthetic Minority Over-sampling Technique for addressing class imbalance problems, Fast Correlation-Based Filter for reducing high-dimensional features, and Hyperopt Tree-structured Parzen Estimator for optimizing clustering and machine classifiers' parameters. Four supervised machine classifiers — Decision Tree classifier, Random Forest classifier, Extra Trees classifier, and XGBoost classifier— were trained and validated on the NSL-KDD IDS dataset. Additionally, experimental analysis indicated a superior detection accuracy for all classifiers, for which the best-optimized XGBoost classifier and best-optimized Random Forest classifier provided 99.57% and 99.51% accuracy, respectively. The proposed clustering-optimized machine IDS approach provided substantial improvements for identifying minority class attacks, along with sustainability and high generalization capabilities. The obtained outcomes support the research premise concerning the efficacy of cluster-aware sampling and ensemble optimizations for designing more balanced, accurate, and adaptive IDS systems for effectively protecting against ever-escalating real-life threats within the cyberworld.
- Research Article
- 10.1038/s41598-026-39104-3
- Mar 3, 2026
- Scientific reports
- Santosh Kumar + 6 more
The identification of bipolar disorder (BD), a severe psychiatric condition characterized by recurrent mood fluctuations, remains challenging due to substantial inter-individual variability, symptom overlap with other mental disorders, and imbalanced clinical data. Delayed or inaccurate diagnosis often leads to inappropriate treatment strategies and adverse clinical outcomes, highlighting the need for reliable, data-driven decision-support tools. In this study, we propose a robust hybrid machine learning framework that integrates class balancing, latent subgroup discovery, and ensemble learning to improve the accuracy and consistency of BD identification from tabular clinical data. The framework applies the Synthetic Minority Over-sampling Technique (SMOTE) exclusively to the training data to address class imbalance, followed by Gaussian Mixture Model (GMM) based clustering to uncover latent patient subgroups and generate informative probabilistic features. These enriched features are subsequently used to train an optimized Extreme Gradient Boosting (XGBoost) classifier. Experimental evaluation on an independent test set demonstrates that the proposed model achieves93% accuracy,97% sensitivity (recall),93% precision,95% F1-score, and79% specificity. When evaluated under identical experimental conditions, the proposed framework consistently outperforms baseline classifiers, including Support Vector Machine, Decision Tree, Logistic Regression, and Random Forest, with performance improvements ranging from6 to 12%, depending on the comparator. The results indicate that combining SMOTE-based data balancing, GMM-driven latent feature enrichment, and gradient-boosted decision trees yields a scalable, interpretable, and clinically relevant decision-support system. This study supports the adoption of hybrid, data-driven approaches for early BD screening and personalized treatment planning in psychiatric healthcare settings.
- Research Article
- 10.1016/j.jms.2026.112082
- Mar 1, 2026
- Journal of Molecular Spectroscopy
- John T Allen + 3 more
Oversampling and feature selection techniques in binary molecular classification of VUV absorption spectra
- Research Article
- 10.1016/j.foodchem.2026.148061
- Mar 1, 2026
- Food chemistry
- In-Hwan Lee + 2 more
Avocado ripeness classification using handheld Raman spectroscopy: addressing data imbalance with machine learning and resampling techniques.
- Research Article
- 10.1016/j.compag.2026.111414
- Mar 1, 2026
- Computers and Electronics in Agriculture
- Ziheng Feng + 15 more
Utilization of synthetic minority oversampling technique and transfer learning for improving rice and wheat LAI estimation
- Research Article
- 10.1016/j.ress.2026.112583
- Mar 1, 2026
- Reliability Engineering & System Safety
- Fangyuan Tian + 5 more
An investigation of miners’ safety situational awareness across multiple contexts and human reliability assessment using an integrated Extreme Gradient Boosting and Borderline Synthetic Minority Over-sampling Technique
- Research Article
- 10.1016/j.neucom.2026.132643
- Mar 1, 2026
- Neurocomputing
- Xiaoying Liu + 14 more
KHOI-SMOTE: An efficient oversampling technique based on k-means clustering and h-outlyingness index for imbalanced medical data
- Research Article
- 10.3897/jucs.189356
- Feb 28, 2026
- JUCS - Journal of Universal Computer Science
- Christian Gütl
Dear Readers, It gives me great pleasure to announce the second regular issue of 2026. I would like to thank all the authors for their sound research and the editorial board and guest reviewers for the extremely valuable reviews and suggestions for improvement. These contributions together with the support of the community enable us to run our journal and maintain its quality.  I would still like to expand our editorial board: If you are a tenured associate professor or above with a good publication record, please apply to join our editorial board. We are also interested in high-quality proposals for special issues on new topics and emerging trends.  In this regular issue, I am very pleased to present 6 accepted papers by 20 authors from 6 countries: Brazil, Germany, India, North Macedonia, Saudi Arabia, Türkiye.   Gustavo Lazarotto Schroeder, Wesllei Felipe Heckler, Rosemary Francisco, and Jorge Luis Victória Barbosa from Brazil address in their manuscript the growing problem of problematic smartphone use (PSU) by proposing OntoKratos, an ontology-based approach that models contextual, demographic, and mental health information to identify PSU and recommend personalized interventions through semantic reasoning. The research contributes a formal and reusable ontology with SWRL-based inference mechanisms, demonstrating through simulated data that OntoKratos effectively classifies PSU states, infers risk factors, and generates evidence-based intervention suggestions.  In a collaborative research between colleagues from North Macedonia and Germany, Aleksandar Velinov, Aleksandra Mileva, Simon Volpert, Sebastian Zillien, and Steffen Wendzel look into the steganographic analysis of different network protocols which becomes a necessary part of their security evaluation, to prevent their abuse as carriers of hidden messages. In this manuscript, twenty novel covert channels are identified in QUIC, with an accent on their transmission rate, undetectability, and robustness, suggested countermeasures, and one implemented covert channel as a proof-of-concept. Hanan Hafiz and Maher Alharby from Saudi Arabia introduce in their work a study that aims to develop efficient machine learning models for detecting DDoS attacks in cloud environments by addressing challenges related to multi-tenant traffic patterns and virtualized infrastructure constraints. The main contributions of this study include binary and multiclass DDoS classification with feature selection, evaluation of model performance and computational efficiency, and mitigation of data imbalance using oversampling techniques. Kausthav Pratim Kalita, Debojit Boro, and Dhruba Kumar Bhattacharyya from India investigate in their research the issue that big data platforms face limitations in centralized access control despite their distributed architecture and propose integrating blockchain technology using smart contracts to enable secure and controlled access to cluster resources. Through Ethereum-based simulations, the study demonstrates that appropriate indexing and hashing mechanisms can effectively enforce access control while maintaining acceptable execution cost and execution time. Gamze Cabadag, Ali Degirmenci, and Omer Karal from Türkiye research in their work FFT-based radar frequency estimation errors arising from non-integer FFT bin alignment and evaluate twelve interpolation techniques under Gaussian and Laplace noise over varying SNRs and bandwidths. Monte Carlo analyses combined with FLOPs-based complexity evaluation show that the improved Quinn method achieves the highest estimation accuracy for both noise types, while simpler methods offer lower computational cost with reduced performance. Last but not least, Mashael M. Alsulami, Kholoud Althobaiti and Haneen Algethami from Saudi Arabia address in their paper the limitation of traditional job recommendation systems by introducing JobMatcher, a multi-layered framework that combines content-based filtering-KNN, and large language model–based evaluation to better capture career context and progression. The findings show that utilizing ChatGPT as a refinement layer improves alignment with expert judgments, resulting in more relevant and realistic job recommendations. Enjoy Reading! Best regards, Christian Gütl, Managing Editor-in-Chief Graz University of Technology, Graz, Austria
- Research Article
- 10.3897/jucs.140733
- Feb 28, 2026
- JUCS - Journal of Universal Computer Science
- Hanan Hafiz + 1 more
The rapid adoption of cloud computing has revolutionized how businesses and consumers access and utilize resources, offering scalability, flexibility, and cost effectiveness. However, this increased reliance on cloud services has also led to a rise in Distributed Denial of Service (DDoS) attacks, which can severely impact the availability and performance of these services. This study aims to address the critical need for effective detection and classification of DDoS attacks in cloud environments using machine learning techniques. We conducted binary and multiclass classification experiments using the CICDDoS2019 dataset, focusing on three specific types of attacks. Four machine learning models, namely Random Forest, K Nearest Neighbor, Naïve Bayes, and Logistic Regression, were implemented in a Kaggle notebook using Python. Feature selection techniques, including Chi square and Principal Component Analysis, were employed to identify the most relevant features, while the oversampling technique was used to handle imbalanced data. The experiments yielded impressive results, with Random Forest and K-Nearest Neighbor achieving the highest accuracy rates of 100% and 99.72% in binary classification, and 100% and 99.66% in multiclass classification, respectively. The study also measured training and testing times, along with other performance metrics. These findings highlight the effectiveness of machine learning approaches in tackling cloud based detection challenges while ensuring computational efficiency tailored for dynamic cloud environments.
- Research Article
- 10.9798/kosham.2026.26.1.167
- Feb 28, 2026
- Journal of the Korean Society of Hazard Mitigation
- Jinmi Lee + 3 more
In the event of a large-scale earthquake, it is difficult to promptly estimate damage, making it challenging to establish an effective response strategy. To address these limitations, this study developed a preliminary earthquake damage assessment model for buildings using machine learning based on empirical damage data from the 2017 Pohang Earthquake. Six machine-learning models were established, and resampling techniques such as SMOTE (Synthetic Minority Oversampling Technique) and random sampling were applied for performance comparison and analysis to mitigate the chronic class imbalance problem of the dataset. The results indicate that the application of random sampling generally improves the model performance, with tree-based ensemble models achieving significantly high recall and AUC (Area Under the Curve) values. These findings suggest that the proposed model has a strong potential as an effective damage assessment tool for reliably detecting damaged buildings with minimal false negatives.
- Research Article
- 10.3390/rs18050729
- Feb 28, 2026
- Remote Sensing
- Sina Jarahizadeh + 1 more
Estimating individual tree Above-Ground Biomass (AGB) is essential for assessing ecological functions and carbon storage in both forest and urban environments. Traditional field-based methods, such as plot measurements, are costly and impractical for large-scale applications. However, satellite- and aerial-based techniques lack the spatial resolution for individual-tree-level analysis. Unmanned Aerial Vehicle (UAV) Light Detection and Ranging (LiDAR) data, combined with machine learning (ML), offers a powerful alternative for detailed tree structure measurement and AGB estimation. Leveraging advances in deep-learning-based individual tree detection and geometric structure estimation including Height (H), Surface Area (SA), Volume (V), and Crown Width (CW), this study develops ML regression models for estimating individual tree AGB. We explore three objectives: (1) evaluating four regression models including Random Forest (RF), Extreme Gradient Boosting (XGBoost), Support Vector Machine (SVM), and Feed-Forward Neural Network (FFNN); (2) sensitivity assessment of different geometric feature combinations on model accuracy; and (3) improving model robustness using Synthetic Minority Over-sampling Technique (SMOTE) data augmentation for addressing imbalanced data. Results show that the RF model outperforms others that achieved the lowest RMSE and most balanced residual distribution. CW was the strongest single predictor of AGB and, in combination with H, yielded to the most accurate results. This combination improved RMSE and R2 by 14.2% and 89.3% with respect to single-variable-based models. The integration of SMOTE and RF further improved model performance since it lowered RMSE by 225.6 kg (~22.1%) and increased R2 by 0.76 (~49.0%). This was particularly evident in underrepresented low and high AGB ranges. The proposed RF-SMOTE approach is a cost-effective and scalable approach for generating high-quality ground truth data to enable large-scale satellite-based biomass estimation and help forest carbon accounting and planning in cities and forests.
- Research Article
- 10.5582/bst.2025.01323
- Feb 28, 2026
- Bioscience trends
- Tzu-Chun Lin + 16 more
Closely associated with metabolic disorders, non-alcoholic fatty liver disease (NAFLD) substantially increases the risk of hepatocellular carcinoma. This study aimed to apply machine learning (ML) algorithms to a community-based cohort in southern Taiwan to identify key risk factors for NAFLD and to develop predictive models with clinical applicability. Data were derived from community health examinations, and eighteen clinical and demographic features were analyzed. Five ML algorithms were evaluated: logistic regression (LR), random forest (RF), K-nearest neighbors (KNN), adaptive boosting (AdaBoost), and extreme gradient boosting (XGBoost). Model performance was assessed using accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUROC). A total of 7,510 participants were included (38.8% male; mean age 50.9 ± 15.0 years). The dataset was randomly divided into training (80%) and testing (20%) subsets, with no significant differences observed between groups in most independent variables. The Synthetic Minority Over-sampling Technique (SMOTE) was employed to balance NAFLD and non-NAFLD groups in the training dataset. Among all models, XGBoost achieved the highest performance, with an accuracy of 83.48%, precision of 84.31%, recall of 81.21%, F1 score of 82.72%, and AUROC of 92.85%. Feature importance analysis identified low-density lipoprotein cholesterol (LDL-C), body mass index (BMI), waist circumference, fasting plasma glucose (FPG), and triglycerides (TG) as the most influential predictors of NAFLD. ML algorithms, particularly XGBoost, demonstrated high accuracy in predicting NAFLD and effectively identified key clinical predictors. These findings may enhance early diagnosis and facilitate the development of targeted intervention strategies in the management of NAFLD.
- Research Article
- 10.3390/sym18030412
- Feb 27, 2026
- Symmetry
- Guoping Liu + 5 more
With the evolution of smart grids, power communication networks are increasingly required to support high-bandwidth and diversified services such as high-definition video, real-time control, and positioning—services that impose dual challenges of communication capacity and spectrum constraints—under severe resource limitations. Conventional orthogonal modulation schemes exhibit significant limitations in spectral efficiency and concurrent access capabilities, particularly in supporting high-density user environments. To address this, we propose a communication system based on non-orthogonal overlapped chirp modulation, in which the intrinsic symmetry properties of chirp waveforms are utilized to enhance system design and performance. We first construct the system architecture with a multi-symbol concurrent transmission scheme and introduce continuous orthogonal phase modulation to improve symbol distinguishability and mitigate inter-symbol interference—an approach that effectively harnesses signal symmetry for interference suppression. At the receiver, a low-complexity demodulation algorithm based on correlation matrix computation is developed, further improved through oversampling techniques that exploit temporal and spectral symmetry in signal design. Monte Carlo simulations confirm that the proposed system outperforms traditional orthogonal chirp and orthogonal frequency division multiplexing systems in bit error rate performance and spectral efficiency across varying signal-to-noise ratios and modulation schemes. The proposed NOOC system achieves spectral efficiency scaling linearly with concurrency level K, reaching up to 16 bits/s/Hz for K = 16 with BPSK, compared to 1 bit/s/Hz in orthogonal systems. The study provides both a theoretical foundation and practical insights for developing symmetry-aware, efficient, and reliable air interface technologies suitable for future power-private networks.
- Research Article
- 10.1021/acsami.5c24910
- Feb 25, 2026
- ACS applied materials & interfaces
- Kevin Dedecker + 3 more
Converting atomic layer deposited (ALD) ZnO thin films into high-quality zeolitic imidazolate framework-8 (ZIF-8) membranes poses significant challenges in identifying optimal synthesis conditions. This study employs a comprehensive machine learning approach to predict conversion outcomes based on 68 experimental conditions with varying solvent systems, temperatures, and reaction durations. We systematically evaluated 7 classification algorithms including k-nearest neighbors (k-NN), random forests, neural networks, and decision trees using stratified 10-fold cross-validation. The optimized k-NN classifier (k = 5) achieved 92.6% accuracy with a Kappa statistic of 0.791, demonstrating excellent discrimination between high- and low-quality membrane layer outcomes. Feature importance analysis identified the primary solvent as the most influential predictor, followed by temperature and reaction duration within specific regimes. Decision tree analysis further revealed a critical temperature threshold of 80 °C for methanol-based systems, below which extended reaction times are required. Application of Synthetic Minority Oversampling Technique (SMOTE) improved minority class detection while maintaining high specificity. The developed predictive framework enables the screening of conversion conditions with over 90% confidence, potentially reducing the number of experimental trials significantly while accelerating the discovery and optimization of ZIF-8 membrane fabrication protocols. This data-driven methodology provides a blueprint for extending machine learning-based optimization to other metal-organic framework systems and complex materials synthesis challenges.
- Research Article
- 10.3389/fmed.2026.1751311
- Feb 25, 2026
- Frontiers in medicine
- Ziyan Gan + 9 more
To investigate the factors associated with in-hospital survival prognosis in participants with malignant tumors complicated by sepsis and to develop a predictive model. A retrospective study was conducted to collect data from 2,152 participants with malignant tumors complicated by sepsis, hospitalized at Guangdong Provincial Hospital of Chinese Medicine between January 2014 and June 2024. Univariate and multivariable logistic regression analyses were performed to identify independent risk factors, and the ADASYN oversampling technique was applied to address class imbalance. The dataset was randomly split into training and testing sets at an 8:2 ratio. Key features were selected using the recursive feature elimination (RFE) method, and eight machine learning models (logistic regression, decision tree, random forest, K-nearest neighbors, support vector machine, naive Bayes, stochastic gradient boosting, and neural network) were evaluated and hyperparameter-optimized. A total of 2,152 participants were included in the study, with an in-hospital mortality rate of 12.6%. Multivariable analysis indicated that age, SOFA score, coagulation dysfunction, and metabolic abnormalities were important prognostic risk factors. The random forest model showed excellent discriminative ability on the validation set, with an AUC of 0.95, sensitivity of 91%, and specificity of 85%. A total of 10 features with the highest predictive value were selected using the RFE method, including troponin T, platelet distribution width, neutrophil count, red blood cell distribution width, fibrinogen, prothrombin time activity, aspartate transaminase, urea, low-density lipoprotein cholesterol, and creatinine. Age, SOFA score, coagulation dysfunction, and metabolic abnormalities are important prognostic risk factors for participants with malignant tumors complicated by sepsis. The random forest model constructed based on these key features has good predictive performance and can provide a powerful tool for the prognosis assessment of participants with malignant tumors complicated by sepsis. Future research needs to further validate the applicability and practical value of the model in different populations.