Similarity-Weighted IoU (sIOU): A Comprehensive Metric for Evaluating Model Performance Through Similarity-Weighted Class Overlaps

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Semantic segmentation is crucial for comprehending a 2D image or a 3D point cloud by categorizing each pixel/point into a specific semantic class. A key objective of machine learning models used for segmentation is to improve the model accuracy and reliability by reducing misclassifications. Existing approaches treat all misclassification errors uniformly, lacking consideration for specific application requirements. Recognizing that some misclassifications may be more acceptable than others, depending on the particular class and application, we propose a novel Similarity-weighted Intersection over Union metric (sIOU). This approach provides a comprehensive evaluation framework, particularly in challenging scenarios with indistinct class boundaries, incorporating wrongly segmented classes with a similarity score. The metric offers a more refined assessment of a model’s segmentation performance in complex, real-world environments. Validation with an in-house indoor dataset demonstrates its superiority over existing metrics, highlighting its flexibility for diverse application contexts.

Similar Papers
  • Research Article
  • 10.3390/jcm14248934
Early Prediction of Acute Respiratory Distress Syndrome in Critically Ill Polytrauma Patients Using Balanced Random Forest ML: A Retrospective Cohort Study.
  • Dec 17, 2025
  • Journal of clinical medicine
  • Nesrine Ben El Hadj Hassine + 8 more

Background/Objectives: Acute respiratory distress syndrome (ARDS) represents a critical complication in polytrauma patients, characterized by diffuse lung inflammation and bilateral pulmonary infiltrates with mortality rates reaching 45% in intensive care units (ICU). The heterogeneous nature of ARDS and complex clinical presentation in severely injured patients poses substantial diagnostic challenges, necessitating early prediction tools to guide timely interventions. Machine learning (ML) algorithms have emerged as promising approaches for clinical decision support, demonstrating superior performance compared to traditional scoring systems in capturing complex patterns within high-dimensional medical data. Based on the identified research gaps in early ARDS prediction for polytrauma populations, our study aimed to: (i) develop a balanced random forest (BRF) ML model for early ARDS prediction in critically ill polytrauma patients, (ii) identify the most predictive clinical features using ANOVA-based feature selection, and (iii) evaluate model performance using comprehensive metrics addressing class imbalance challenges. Methods: This retrospective cohort study analyzed 407 polytrauma patients admitted to the ICU of the Center of Traumatology and Major Burns of Ben Arous, Tunisia, between 2017 and 2021. We implemented a comprehensive ML pipeline that incorporates Tomek Links undersampling, ANOVA F-test feature selection for the top 10 predictive variables, and SMOTE oversampling with a conservative sampling rate of 0.3. The BRF classifier was trained with class weighting and evaluated using stratified 5-fold cross-validation. Performance metrics included AUROC, PR-AUC, sensitivity, specificity, F1-score, and Matthews correlation coefficient. Results: Among 407 patients, 43 developed ARDS according to the Berlin definition, representing a 10.57% incidence. The BRF model demonstrated exceptional predictive performance with an AUROC of 0.98, a sensitivity of 0.91, a specificity of 0.80, an F1-score of 0.84, and an MCC of 0.70. Precision-recall AUC reached 0.86, demonstrating robust performance despite class imbalance. During stratified cross-validation, AUROC values ranged from 0.93 to 0.99 across folds, indicating consistent model stability. The top 10 selected features included procalcitonin, PaO2 at ICU admission, 24-h pH, massive transfusion, total fluid resuscitation, presence of pneumothorax, alveolar hemorrhage, pulmonary contusion, hemothorax, and flail chest injury. Conclusions: Our BRF model provides a robust, clinically applicable tool for early prediction of ARDS in polytrauma patients using readily available clinical parameters. The comprehensive two-step resampling approach, combined with ANOVA-based feature selection, successfully addressed class imbalance while maintaining high predictive accuracy. These findings support integrating ML approaches into critical care decision-making to improve patient outcomes and resource allocation. External validation in diverse populations remains essential for confirming generalizability and clinical implementation.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 56
  • 10.3390/w10060710
Calibration Parameter Selection and Watershed Hydrology Model Evaluation in Time and Frequency Domains
  • May 31, 2018
  • Water
  • Karthik Kumarasamy + 1 more

Watershed scale models simulating hydrological and water quality processes have advanced rapidly in sophistication, process representation, flexibility in model structure, and input data. With calibration being an inevitable step prior to any model application, there is need for a simple procedure to assess whether or not a parameter should be adjusted for calibration. We provide a rationale for a hierarchical selection of parameters to adjust during calibration and recommend that modelers progress from parameters that are most uncertain to parameters that are least uncertain, namely starting with pure calibration parameters, followed by derived parameters, and finally measured parameters. We show that different information contained in time and frequency domains can provide useful insight regarding the selection of parameters to adjust in calibration. For example, wavelet coherence analysis shows time periods and scales where a particular parameter is sensitive. The second component of the paper discusses model performance evaluation measures. Given the importance of these models to support decision-making for a wide range of environmental issues, the hydrology community is compelled to improve the metrics used to evaluate model performance. More targeted and comprehensive metrics will facilitate better and more efficient calibration and will help demonstrate that the model is useful for the intended purpose. Here, we introduce a suite of new tools for model evaluation, packaged as an open-source Hydrologic Model Evaluation (HydroME) Toolbox. We apply these tools in the calibration and evaluation of Soil and Water Assessment Tool (SWAT) models of two watersheds, the Le Sueur River Basin (2880 km2) and Root River Basin (4300 km2) in southern Minnesota, USA.

  • Research Article
  • 10.71356/ijaia.v1.i2.66
Performance Evaluation of Classical Machine Learning Models for Emotion Classification
  • Dec 31, 2025
  • International Journal of Artificial Intelligence Applications
  • Motaz Zghoul + 1 more

Emotion detection in textual data represents a critical challenge in natural language processing with applications in mental health monitoring, customer sentiment analysis, and human-computer interaction. This study investigates three classical machine learning algorithms for multi-class emotion classification across eleven emotional categories using a balanced dataset of approximately 106,000 annotated sentences. The research employs Term Frequency-Inverse Document Frequency vectorization with trigram support and 3,000-dimensional feature space. Logistic Regression, Random Forest, and Naive Bayes classifiers were evaluated using comprehensive metrics including accuracy, precision, recall, F1-score, and five-fold cross-validation. Results demonstrate that Logistic Regression achieved superior performance with 79.90% accuracy, 81.18% precision, and 80.27% F1-score, substantially exceeding Random Forest at 75.32% and Naive Bayes at 69.01%. Cross-validation analysis revealed remarkable stability with standard deviations below 0.5%, confirming robust generalization. Per-class analysis identified enthusiasm, love, and neutral as most reliably detected emotions exceeding 83% accuracy, while empty and sadness presented greater challenges. The findings validate that classical machine learning approaches with proper feature engineering achieve competitive performance for fine-grained emotion detection while offering advantages in computational efficiency, interpretability, and deployment simplicity.

  • Preprint Article
  • 10.20944/preprints202409.0760.v1
Designing a Holistic Student Evaluation Model for E-Learning Using a Multi-Task Attention-Based Deep Learning Approach
  • Sep 10, 2024
  • Preprints.org
  • Deborah Olaniyan + 4 more

This study presents the development and evaluation of a Multi-Task Long Short-Term Memory (LSTM) model with an Attention Mechanism designed to predict students' academic performance. The model concurrently addresses two tasks: predicting overall performance (total score) as a regression task and categorizing performance levels (remarks) as a classification task. By processing both tasks simultaneously, the model optimizes computational efficiency and resource use. The dataset includes detailed student performance records across various metrics such as Continuous Assessment, Practical Skills, Demeanor, Presentation Quality, Attendance, and Participation. The model's performance was evaluated using comprehensive metrics. For the regression task, it achieved a Mean Absolute Error (MAE) of 0.0249, Mean Squared Error (MSE) of 0.0012, and Root Mean Squared Error (RMSE) of 0.0346. For the classification task, it attained perfect scores with an accuracy, precision, recall, and F1 score of 1.0. These results highlight the model's high accuracy and robustness in predicting both continuous and categorical outcomes. The Attention Mechanism enhances the model's capabilities by identifying and focusing on the most relevant features. This study demonstrates the effectiveness of the Multi-Task LSTM with Attention Mechanism in educational data analysis, offering a reliable tool for predicting student performance and potential broader applications in similar multi-task learning contexts. Future work will explore further enhancements and wider applications to improve predictive accuracy and efficiency.

  • Research Article
  • Cite Count Icon 1
  • 10.54254/2755-2721/112/20251785
Evaluating Machine Learning Techniques for Credit Risk Management: An Algorithmic Comparison
  • Nov 29, 2024
  • Applied and Computational Engineering
  • Bowen Han

The evaluation of credit risk has become an indispensable element within the financial sector. This research aims to conduct a comparative examination of several machine learning model's performance in predicting credit risk. This research uses comprehensive metrics to give a comparative examination of six machine learning models, including Random Forests (RF) and Support Vector Machines (SVM). The features used in the training of these models were screened by a combination of Random Forest feature importance and Recursive Feature Elimination (RFE) to ensure model accuracy. After comparing the model results, the study concluded that the Random Forest model combined with RFE performed the best among all the risk columns with an accuracy of 0.71. KNN was the next best with an accuracy of 0.69. Logistic regression was the worst performer among the six models with an accuracy of only 0.29. In the study of this paper, the imbalance of the dataset categories resulted in a weak identification of moderate risk categories. It shows that the model is not well adapted to the dataset with imbalanced categories. The paper validates the viability of machine learning in credit risk by offering useful advice on how it may be applied. To further enhance prediction performance, future studies could investigate the combination of more advanced data-balancing strategies and deep learning approaches.

  • Research Article
  • Cite Count Icon 1
  • 10.1038/s41598-024-84716-2
Integrated intraoperative predictive model for malignancy risk assessment of thyroid nodules with atypia of undetermined significance cytology
  • Jan 13, 2025
  • Scientific Reports
  • Cheng Li + 3 more

Management of thyroid nodules with atypia of undetermined significance/follicular lesion of undetermined significance (AUS/FLUS) cytology is challenging because of uncertain malignancy risk. Intraoperative frozen section pathology provides real-time diagnosis for AUS/FLUS nodules undergoing surgery, but its accuracy is limited. This study aimed to develop an integrated predictive model combining clinical, ultrasound and IOFS features to improve intraoperative malignancy risk assessment. A retrospective cohort study was conducted on patients with AUS/FLUS cytology and negative BRAFV600E mutation who underwent thyroid surgery. The cohort was randomly divided into training and validation sets. Clinical, ultrasound, and pathological features were extracted for analysis. Three models were developed: an IOFS model with IOFS results as sole predictor, a clinical model integrating clinical and ultrasound features, and an integrated model combining all features. Model performance was evaluated using comprehensive metrics in both sets. The superior model was visualized as a nomogram. Among 531 included patients, the integrated model demonstrated superior diagnostic ability, predictive performance, calibration, and clinical utility compared to other models. It exhibited AUC values of 0.92 in the training set and 0.95 in the validation set. The nomogram provides a practical tool for estimating malignancy probability intraoperatively. This study developed an innovative integrated predictive model for intraoperative malignancy risk assessment of AUS/FLUS nodules. By combining clinical, ultrasound, and IOFS features, the model enhances IOFS diagnostic sensitivity, providing a reliable decision-support tool for optimizing surgical strategies.

  • Research Article
  • 10.2196/77858
Development and Validation of a Web-Based Machine Learning Model for Predicting Early Neurological Deterioration Following Stroke Thrombolysis: Multicenter Study
  • Dec 10, 2025
  • Journal of Medical Internet Research
  • Juan Li + 18 more

BackgroundEarly neurological deterioration (END) significantly worsens outcomes in patients with acute ischemic stroke (AIS) receiving intravenous thrombolysis, yet clinicians lack reliable tools to identify high-risk patients who need intensified monitoring and preemptive interventions.ObjectiveThis study aimed to develop and validate a high-performance machine learning model for END prediction that enables personalized risk-stratified management of patients with AIS after thrombolysis.MethodsThis multicenter study analyzed 1927 patients with AIS who were treated with intravenous thrombolysis in 3 hospitals, comprising a development cohort (n=1361) from Lianyungang Clinical Medical College and an external validation cohort (n=566) from 2 independent hospitals. We systematically evaluated 27 clinical parameters using multiple machine learning algorithms to develop ENDRAS (Early Neurological Deterioration Risk Assessment Score), a prediction model based on 6 readily available clinical variables. Model performance was assessed through comprehensive metrics (area under the receiver operating characteristic curve, accuracy, precision, recall, F1-score) in both internal and external validation cohorts.ResultsThe XGBoost-based ENDRAS showed promising predictive performance (area under the receiver operating characteristic curve=0.988, 95% CI 0.983‐0.993) using 6 readily available parameters: Trial of ORG 10172 in Acute Stroke Treatment classification, intracranial artery stenosis severity, National Institutes of Health Stroke Scale score, systolic blood pressure, neutrophil count, and red blood cell distribution width. We established a dual-pathway management protocol for stratifying patients into low-risk (<29%) and high-risk (≥29%) groups, where high-risk patients receive intensive monitoring with hourly assessments and expedited imaging, while low-risk patients follow a resource-optimized protocol without compromising safety. Implemented as a web-based calculator with a <0.02-second computation time, ENDRAS enables real-time clinical decision support at the point of care.ConclusionsENDRAS integrates END prediction into actionable clinical pathways, potentially improving postthrombolysis care through personalized monitoring strategies and targeted interventions. Its robust performance in merged cohorts, efficient computation time, and structured management framework address key challenges in stroke care while enhancing resource utilization. Further prospective validation across diverse populations is needed to fully establish ENDRAS as a standard clinical decision-support system, but its ability to identify high-risk patients early may significantly improve outcomes in AIS.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/forecast6040054
Forecasting Raw Material Yield in the Tanning Industry: A Machine Learning Approach
  • Nov 20, 2024
  • Forecasting
  • Ismael Cristofer Baierle + 4 more

This study presents an innovative machine learning (ML) approach to predicting raw material yield in the leather tanning industry, addressing a critical challenge in production efficiency. Conducted at a tannery in southern Brazil, the research leverages historical production data to develop a predictive model. The methodology encompasses four key stages: data collection, processing, prediction, and evaluation. After rigorous analysis and refinement, the dataset was reduced from 16,046 to 555 high-quality records. Eight ML models were implemented and evaluated using Orange Data Mining software, version 3.38.0, including advanced algorithms such as Random Forest, Gradient Boosting, and neural networks. Model performance was assessed through cross-validation and comprehensive metrics, including Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Squared Error (MSE), and Coefficient of Determination (R2). The AdaBoost algorithm emerged as the most accurate predictor, achieving impressive results with an MAE of 0.042, MSE of 0.003, RMSE of 0.057, and R2 of 0.331. This research demonstrates the significant potential of ML techniques in enhancing raw material yield forecasting within the tanning industry. The findings contribute to more efficient forecasting processes, aligning with Industry 4.0 principles and paving the way for data-driven decision-making in manufacturing.

  • Research Article
  • 10.3389/fpubh.2025.1609206
Performance comparison of artificial intelligence models in predicting 72-h emergency department unscheduled return visits
  • Dec 19, 2025
  • Frontiers in Public Health
  • Lumin Fan + 5 more

BackgroundUnscheduled return visits (URVs) to emergency departments (EDs) contribute significantly to healthcare burden through resource utilization and ED overcrowding. While artificial intelligence (AI) methodologies show potential in URV prediction, existing studies have employed limited algorithms with moderate performance, highlighting the need for comprehensive AI architecture comparison within unified cohorts.ObjectiveThis study evaluated the predictive performance of multiple AI models for 72-h ED URVs, aiming to identify optimal risk stratification strategies for improved discharge planning and targeted interventions.MethodsThis retrospective study analyzed adult internal medicine visits to the ED at a tertiary hospital. URVs were defined as ED revisits occurring within 72 h after initial ED discharge time. The dataset was partitioned into training (70%) and testing (30%) sets. Four traditional machine learning algorithms (logistic regression, support vector machine, random forest, and extreme gradient boosting) and one deep learning architecture (TabNet) were developed with Bayesian optimization for hyperparameter tuning. Model performance was assessed through comprehensive metrics including discrimination, calibration, clinical utility, and confusion matrices. The optimal model underwent feature importance analysis, systematic ablation studies, sensitivity analyses, and subgroup fairness evaluation.ResultsOf 143,192 analyzed visits, 24,117 (16.8%) were classified as URVs. Data were allocated into training (n = 100,235) and testing (n = 42,957) sets with consistent URV proportions. TabNet demonstrated optimal discriminative performance with AUROC 0.867 (95% CI: 0.854–0.880) and sensitivity of 0.809 (95% CI: 0.801–0.816). Decision curve analysis demonstrated sustained clinical utility across threshold probabilities of 10–30%. Feature importance analysis identified initial diagnoses of digestive and respiratory system diseases, patient age, P3 triage classification, and ED visit frequency as key predictive variables. Subgroup analysis confirmed consistent performance across patient demographics and clinical characteristics.ConclusionTabNet outperformed traditional machine learning approaches in predicting 72-h ED URVs, offering potential for improved risk stratification in emergency care settings.

  • Research Article
  • 10.1186/s12893-025-03154-7
Detecting pancreaticobiliary maljunction in pediatric congenital choledochal malformation patients using machine learning methods
  • Oct 3, 2025
  • BMC Surgery
  • Yifeng Shao + 8 more

ObjectiveThe presence of pancreaticobiliary maljunction (PBM) in pediatric patients with congenital choledochal malformation significantly impacts clinical management and surgical decision-making. Current preoperative evaluation of PBM coexistence remains challenging in children, while intraoperative cholangiography does not consistently provide diagnostic-quality imaging. This study aims to develop machine learning-based algorithm models for detecting pancreaticobiliary maljunction (PBM) in children with congenital choledochal malformation.MethodsWe conducted a retrospective study utilizing data from patients with congenital choledochal malformation treated at our center between January 2019 and January 2024. Demographic characteristics, clinical features, and preoperative laboratory parameters were processed through rigorous data curation and feature engineering pipelines. Cases were allocated via random sampling into training (80%) and hold-out test (20%) cohorts, maintaining strict separation between training and test cohorts. Seven machine learning algorithms - Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), eXtreme Gradient Boosting (XGBoost), Adaptive Boosting (AdaBoost), Light Gradient Boosting Machine (LightGBM), and K-Nearest Neighbors (KNN) - were implemented with five-fold cross-validation. An ensemble voting classifier was specifically constructed using these models. Model performance was quantified through comprehensive metrics including area under the ROC curve (AUC), sensitivity, specificity, positive/negative predictive values, accuracy, precision, recall, and F1-score. This study employed the nonparametric bootstrap method to estimate the confidence interval for the area under the receiver operating characteristic curve (AUC). SHapley Additive exPlanations (SHAP) was employed for model interpretability, with feature importance rankings determined by absolute SHAP value magnitudes.ResultsIn a cohort of 803 pediatric patients with congenital choledochal malformation, 628 (78.2%) demonstrated concurrent pancreaticobiliary maljunction. We developed a detection model incorporating 43 clinical features, with Random Forest showing optimal performance. An ensemble voting classifier integrating seven machine learning algorithms achieved enhanced discriminative performance (AUC: 0.87 (0.81, 0.92); Recall: 0.91 (0.85, 0.95); F1-score: 0.91 (0.87, 0.94)). Key features contributing to PBM detection included: laboratory markers and clinical parameters.ConclusionBy integrating preoperative clinical symptoms and laboratory parameters, machine learning algorithms demonstrated significant detection capability in identifying PBM among pediatric congenital choledochal malformation patients, with the RF model achieving superior performance metrics among all base models. The developed ensemble voting classifier provides valuable preoperative guidance for surgical planning and clinical management, enabling detection of PBM comorbidity before surgery in congenital choledochal malformation cases.

  • Research Article
  • Cite Count Icon 77
  • 10.1016/j.jclepro.2020.123231
Multi-step ahead forecasting of regional air quality using spatial-temporal deep neural networks: A case study of Huaihai Economic Zone
  • Aug 2, 2020
  • Journal of Cleaner Production
  • Kefei Zhang + 3 more

Multi-step ahead forecasting of regional air quality using spatial-temporal deep neural networks: A case study of Huaihai Economic Zone

  • Research Article
  • Cite Count Icon 152
  • 10.1016/j.knosys.2020.106695
Rolling bearing fault diagnosis using optimal ensemble deep transfer network
  • Dec 17, 2020
  • Knowledge-Based Systems
  • Xingqiu Li + 3 more

Rolling bearing fault diagnosis using optimal ensemble deep transfer network

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.cca.2025.120165
Performance and efficiency of machine learning models in analyzing capillary serum protein electrophoresis.
  • Mar 1, 2025
  • Clinica chimica acta; international journal of clinical chemistry
  • Xia Wang + 5 more

Performance and efficiency of machine learning models in analyzing capillary serum protein electrophoresis.

  • Research Article
  • Cite Count Icon 23
  • 10.1016/j.health.2024.100362
A novel integrated logistic regression model enhanced with recursive feature elimination and explainable artificial intelligence for dementia prediction
  • Sep 14, 2024
  • Healthcare Analytics
  • Rasel Ahmed + 6 more

A novel integrated logistic regression model enhanced with recursive feature elimination and explainable artificial intelligence for dementia prediction

  • Research Article
  • Cite Count Icon 9
  • 10.1016/j.clnu.2023.02.005
Sex-specific equations to estimate body composition: Derivation and validation of diagnostic prediction models using UK Biobank
  • Feb 16, 2023
  • Clinical Nutrition
  • Yueqi Lu + 13 more

Body mass index and waist circumference are simple measures of obesity. However, they do not distinguish between visceral and subcutaneous fat, or muscle, potentially leading to biased relationships between individual body composition parameters and adverse health outcomes. The purpose of this study was to develop and validate prediction models for volumetric adipose and muscle. Based on cross-sectional data of 18,457, 18,260, and 17,052 White adults from the UK Biobank, we developed sex-specific equations to estimate visceral adipose tissue (VAT), abdominal subcutaneous adipose tissue (ASAT), and total thigh fat-free muscle (FFM) volumes, respectively. Volumetric magnetic resonance imaging served as the reference. We used the least absolute shrinkage and selection operator and the extreme gradient boosting methods separately to fit three sequential models, the inputs of which included demographics and anthropometrics and, in some, bioelectrical impedance analysis parameters. We applied comprehensive metrics to assess model performance in the temporal validation set. The equations that included more predictors generally performed better. Accuracy of the equations was moderate for VAT (percentage of estimates that differed <30% from the measured values, 70 to 78 in males, 64 to 69 in females) and good for ASAT (85 to 91 in males, 90 to 95 in females) and FFM (99 to 100 in both sexes). All the equations appeared precise (interquartile range of the difference, 0.89 to 1.76L for VAT, 1.16 to 1.61L for ASAT, 0.81 to 1.39L for FFM). Bias of all the equations was negligible (-0.17 to 0.05L for VAT,-0.10 to 0.12L for ASAT,-0.07 to 0.09L for FFM). The equations achieved superior cardiometabolic correlations compared with body mass index and waist circumference. The developed equations to estimate VAT, ASAT, and FFM volumes achieved moderate to good performance. They may be cost-effective tools to revisit the implications of diverse body components.

Save Icon
Up Arrow
Open/Close