Constructing Surrogate Models in Machine Learning Using Combinatorial Testing and Active Learning

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Machine learning (ML)-based models are often black box, making it challenging to understand and interpret their decision-making processes. Surrogate models are constructed to approximate the behavior of a target model and are an essential tool for analyzing black-box models. The construction of a surrogate model typically includes querying the target model with carefully selected data points and using the responses from the target model to infer information about its structure and parameters.

Similar Papers
  • Research Article
  • Cite Count Icon 6
  • 10.1371/journal.pone.0318167
Data-driven survival modeling for breast cancer prognostics: A comparative study with machine learning and traditional survival modeling methods.
  • Apr 22, 2025
  • PloS one
  • Theophilus Gyedu Baidoo + 1 more

Background This investigation delves into the potential application of data-driven survival modeling approaches for prognostic assessments of breast cancer survival. The primary objective is to evaluate and compare the ability of machine learning (ML) models and conventional survival analysis techniques, to identify consistent key predictors of breast cancer survival outcomes. Methods This study employs data-driven survival modeling approaches to predict breast cancer survival, including survival-specific methods such as the Cox Proportional Hazards (CPH) model, Random Survival Forests (RSF), and Cox Proportional Deep Neural Networks (DeepSurv), as well as machine learning models like Random Forests (RF), XGBoost, Support Vector Machines (SVM) with an RBF Kernel, and LightGBM. The dataset, sourced from the National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) program, comprises 4,024 women diagnosed with infiltrating duct and lobular carcinoma breast cancer between 2006 and 2010. To ensure interpretability across all models, the Shapley Additive Explanation (SHAP) method was applied to RSF, DeepSurv, Random Forests (RF), and XGBoost. This enabled the identification of key predictors influencing breast cancer survival, highlighting consistent factors across models while uncovering unique insights specific to each approach. Results The performance of survival-specific and ML models were evaluated using the Concordance index (C-index), Integrated Brier Score (IBS), mean accuracy, and mean AUC. The CPH model achieved a C-index of 0.71±0.015 and an IBS of 0.08±0.006, while RSF demonstrated slightly better discriminatory power with a C-index of 0.72±0.0117. DeepSurv performed comparably, with a C-index of 0.71±0.0095 and an IBS of 0.09±0.0008. Both Cox and RSF models achieved the lowest IBS (0.08), indicating accurate survival probability predictions over time. For ML models, RF achieved a mean AUC of 0.74±0.0021, and XGBoost with a mean AUC 0.69±0.0183, reflecting fair discriminatory ability but not accounting for censoring in survival data. SHAP analysis for the top-performing models highlighted the extent of lymph node involvement, Regional Node-Positive (number of affected lymph nodes), tumor grade (cell abnormality and growth rate), progesterone status, and age as key predictors of breast cancer survival outcomes. Conclusions While ML models like XGBoost and RF can effectively identify important predictors and patterns in breast cancer outcomes, survival-specific methods such as the Cox model, RSF, and DeepSurv provide essential capabilities for handling time-to-event data and censoring, making them more suitable for accurate survival predictions. The primary objective of including ML models in this analysis was to leverage their interpretability in identifying key variables alongside survival-specific models, rather than to directly compare their performance against survival models. By examining both ML and survival models, this research highlights the complementary strengths of each approach. This study contributes to the integration of artificial intelligence in healthcare, emphasizing the value of data-driven survival modeling techniques in supporting healthcare professionals with accurate, personalized, and actionable insights for high-risk patients. Together, these approaches enhance the precision of survival predictions, paving the way for more informed clinical decision-making and improved patient care.

  • Research Article
  • Cite Count Icon 4
  • 10.1038/s41598-022-20012-1
Perception without preconception: comparison between the human and machine learner in recognition of tissues from histological sections
  • Sep 30, 2022
  • Scientific Reports
  • Sanghita Barui + 4 more

Deep neural networks (DNNs) have shown success in image classification, with high accuracy in recognition of everyday objects. Performance of DNNs has traditionally been measured assuming human accuracy is perfect. In specific problem domains, however, human accuracy is less than perfect and a comparison between humans and machine learning (ML) models can be performed. In recognising everyday objects, humans have the advantage of a lifetime of experience, whereas DNN models are trained only with a limited image dataset. We have tried to compare performance of human learners and two DNN models on an image dataset which is novel to both, i.e. histological images. We thus aim to eliminate the advantage of prior experience that humans have over DNN models in image classification. Ten classes of tissues were randomly selected from the undergraduate first year histology curriculum of a Medical School in North India. Two machine learning (ML) models were developed based on the VGG16 (VML) and Inception V2 (IML) DNNs, using transfer learning, to produce a 10-class classifier. One thousand (1000) images belonging to the ten classes (i.e. 100 images from each class) were split into training (700) and validation (300) sets. After training, the VML and IML model achieved 85.67 and 89% accuracy on the validation set, respectively. The training set was also circulated to medical students (MS) of the college for a week. An online quiz, consisting of a random selection of 100 images from the validation set, was conducted on students (after obtaining informed consent) who volunteered for the study. 66 students participated in the quiz, providing 6557 responses. In addition, we prepared a set of 10 images which belonged to different classes of tissue, not present in training set (i.e. out of training scope or OTS images). A second quiz was conducted on medical students with OTS images, and the ML models were also run on these OTS images. The overall accuracy of MS in the first quiz was 55.14%. The two ML models were also run on the first quiz questionnaire, producing accuracy between 91 and 93%. The ML models scored more than 80% of medical students. Analysis of confusion matrices of both ML models and all medical students showed dissimilar error profiles. However, when comparing the subset of students who achieved similar accuracy as the ML models, the error profile was also similar. Recognition of ‘stomach’ proved difficult for both humans and ML models. In 04 images in the first quiz set, both VML model and medical students produced highly equivocal responses. Within these images, a pattern of bias was uncovered–the tendency of medical students to misclassify ‘liver’ tissue. The ‘stomach’ class proved most difficult for both MS and VML, producing 34.84% of all errors of MS, and 41.17% of all errors of VML model; however, the IML model committed most errors in recognising the ‘skin’ class (27.5% of all errors). Analysis of the convolution layers of the DNN outlined features in the original image which might have led to misclassification by the VML model. In OTS images, however, the medical students produced better overall score than both ML models, i.e. they successfully recognised patterns of similarity between tissues and could generalise their training to a novel dataset. Our findings suggest that within the scope of training, ML models perform better than 80% medical students with a distinct error profile. However, students who have reached accuracy close to the ML models, tend to replicate the error profile as that of the ML models. This suggests a degree of similarity between how machines and humans extract features from an image. If asked to recognise images outside the scope of training, humans perform better at recognising patterns and likeness between tissues. This suggests that ‘training’ is not the same as ‘learning’, and humans can extend their pattern-based learning to different domains outside of the training set.

  • Research Article
  • Cite Count Icon 12
  • 10.1093/ehjqcco/qcad028
Comparative analysis of machine learning vs. traditional modeling approaches for predicting in-hospital mortality after cardiac surgery: temporal and spatial external validation based on a nationwide cardiac surgery registry.
  • May 22, 2023
  • European Heart Journal - Quality of Care and Clinical Outcomes
  • Juntong Zeng + 6 more

Preoperative risk assessment is crucial for cardiac surgery. Although previous studies suggested machine learning (ML) may improve in-hospital mortality predictions after cardiac surgery compared to traditional modeling approaches, the validity is doubted due to lacking external validation, limited sample sizes, and inadequate modeling considerations. We aimed to assess predictive performance between ML and traditional modelling approaches, while addressing these major limitations. Adult cardiac surgery cases (n=168565) between 2013 and 2018 in the Chinese Cardiac Surgery Registry were used to develop, validate, and compare various ML vs. logistic regression (LR) models. The dataset was split for temporal (2013-2017 for training, 2018 for testing) and spatial (geographically-stratified random selection of 83 centers for training, 22 for testing) experiments, respectively. Model performances were evaluated in testing sets for discrimination and calibration. The overall in-hospital mortality was 1.9%. In the temporal testing set (n=32184), the best-performing ML model demonstrated a similar area under the receiver operating characteristic curve (AUC) of 0.797 (95% CI 0.779-0.815) to the LR model (AUC 0.791 [95% CI 0.775-0.808]; P=0.12). In the spatial experiment (n=28323), the best ML model showed a statistically better but modest performance improvement (AUC 0.732 [95% CI 0.710-0.754]) than LR (AUC 0.713 [95% CI 0.691-0.737]; P=0.002). Varying feature selection methods had relatively smaller effects on ML models. Most ML and LR models were significantly miscalibrated. ML provided only marginal improvements over traditional modelling approaches in predicting cardiac surgery mortality with routine preoperative variables, which calls for more judicious use of ML in practice.

  • Preprint Article
  • 10.5194/egusphere-egu23-11636
State-of-the-Art Review of Machine Learning Models in Civil Engineering: Based on DAMIE Classification Tree
  • May 15, 2023
  • Jaehyun Kim + 1 more

For recent years, Machine Learning (ML) models have been proven to be useful in solving problems of a wide variety of fields such as medical, economic, manufacturing, transportation, energy, education, etc. With increased interest in ML models and advances in sensor technologies, ML models are being widely applied even in civil engineering domain. ML model enables analysis of large amounts of data, automation, improved decision making and provides more accurate prediction. While several state-of-the-art reviews have been conducted in each sub-domain (e.g., geotechnical engineering, structural engineering) of civil engineering or its specific application problems (e.g., structural damage detection, water quality evaluation), little effort has been devoted to comprehensive review on ML models applied in civil engineering and compare them across sub-domains. A systematic, but domain-specific literature review framework should be employed to effectively classify and compare the models. To that end, this study proposes a novel review approach based on the hierarchical classification tree “D-A-M-I-E (Domain-Application problem-ML models-Input data-Example case)”. “D-A-M-I-E” classification tree classifies the ML studies in civil engineering based on the (1) domain of the civil engineering, (2) application problem, (3) applied ML models and (4) data used in the problem. Moreover, data used for the ML models in each application examples are examined based on the specific characteristic of the domain and the application problem. For comprehensive review, five different domains (structural engineering, geotechnical engineering, water engineering, transportation engineering and energy engineering) are considered and the ML application problem is divided into five different problems (prediction, classification, detection, generation, optimization). Based on the “D-A-M-I-E” classification tree, about 300 ML studies in civil engineering are reviewed. For each domain, analysis and comparison on following questions has been conducted: (1) which problems are mainly solved based on ML models, (2) which ML models are mainly applied in each domain and problem, (3) how advanced the ML models are and (4) what kind of data are used and what processing of data is performed for application of ML models. This paper assessed the expansion and applicability of the proposed methodology to other areas (e.g., Earth system modeling, climate science). Furthermore, based on the identification of research gaps of ML models in each domain, this paper provides future direction of ML in civil engineering based on the approaches of dealing data (e.g., collection, handling, storage, and transmission) and hopes to help application of ML models in other fields.

  • Conference Article
  • Cite Count Icon 7
  • 10.4043/31938-ms
Hybrid Modeling for Multiphase Flow Simulations
  • Apr 25, 2022
  • Johan Henriksson + 2 more

This paper provides an overview of the following three modeling approaches – Physics-based Modeling, Machine Learning, and Hybrid Modeling (Physics-based Modeling & Machine Learning combined), and addresses their applicability for multiphase flow simulations in specific use-cases. Physics-based modeling builds on well understood concepts in, e.g., thermodynamics, fluid dynamics, fluid modeling and optimization techniques. It requires deep domain knowledge as well as accurate fluid data and may incur significant computational cost. Machine Learning systems are based on learning algorithms, which find relationships between sensor data and output variables in a training dataset. The approach requires a good understanding of the learning algorithms and statistics. Small datasets or changes in operational conditions limit the suitability of this approach. Hybrid Models combine Physics with Machine Learning, and these models are on a sliding scale between pure Physics-based Models and pure Machine Learning Models. The individual use-case defines how the Hybrid Model is configured and where it sits on the sliding scale. Here, we will investigate three different use-cases: Physics supporting Machine Learning: Use physics to generate data to support training of machine learning models when data is sparse. Machine Learning supporting Physics: Use machine learning to provide additional input to physics modeling when available data does not suffice to achieve accurate numerical solutions. True Hybrid: Close the circle and use physics for feature engineering to improve the machine learning models, which then provide synthetic data as input to physics-based simulations. We observe that there is not one universal solution. On the one hand, flow simulations may address situations which are stochastic in nature and not deterministic. Here, a machine learning model based on physics-based simulations fits the purpose well. On the other hand, there are situations when machine learning can deduce relations between parameters such that you can provide additional input into physics-based models. One such example is to address the oil-water split at very high water cuts. By adopting the appropriate combination, the hybrid approach results in superior accuracy and offers the ability to address a broader range of applications than physics or machine learning alone.

  • Preprint Article
  • 10.5194/ems2025-562
Hydrological modelling using machine and deep learning models across multiple case studies
  • Jul 16, 2025
  • Majid Niazkar + 3 more

Machine learning (ML) and deep learning (DL) models can play an important role when it comes to modelling complicated processes. Such capability is necessary for hydrological and climate-related applications. Generally, ML models utilize precipitation and temperature time series of a basin as input to develop a lumped rainfall-runoff model to simulate streamflow at the basin outlet. However, when it is divided into several sub-basins, Graph Neural Networks (GNN) can consider each sub-basin as a node and link them together using a connectivity matrix to account for spatial variations of hydroclimatic variables. In this study, GNN and various ML models with different types of architecture, ranging from neural networks, tree-based structure, and gradient boosting, were exploited for daily streamflow simulation over different case studies. For each case study, the basin was divided into a few sub-basins for which daily precipitation and temperature data were aggregated and used as input. For training GNN, the connection matrix of sub-basins was also used as input. Basically, 75% of historical records were utilized to train GNN and different ML models, e.g., artificial neural networks, support vector machine, decision tree, random forest, eXtreme Gradient Boosting (XGBoost), Light Gradient-Boosting Machine (LightGBM), and Category Boosting (CatBoost), while the rest was used for testing. Streamflow simulation was conducted with/without considering seasonality impact and lag times. The obtained results clearly demonstrate that considering seasonality and time lags can enhance accuracy of streamflow predictions based on Kling–Gupta efficiency (KGE). Furthermore, GNN with seasonality impact and time lags achieved promising results across different case studies with KGE>0.85 for training and KGE>0.59 for testing data, respectively. Among ML models, boosting models, e.g., LightGBM and XGBoost, performed slightly better than other ML models. for Finally, this comparative analysis provides valuable insights for ML/DL applications in climate change impact assessments.Acknowledgements: This research work was carried out as part of the TRANSCEND project with funding received from the European Union Horizon Europe Research and Innovation Programme under Grant Agreement No. 10108411.

  • Conference Article
  • 10.37308/dfi49.2024970301
Develop Machine Learning Models to Establish the Load-Settlement Curve of Piles from Cone Penetration Test Data
  • Oct 6, 2024
  • Murad Abu-Farsakh

The evaluation of load-settlement behavior of piles is very crucial in meeting the serviceability criteria for pile analysis and design. The most reliable approach for estimating this behavior can be achieved by conducting pile load tests. However, due to the considerable expense and time requirement of such in-situ testing, the load-transfer methods have been used routinely in practice. In this paper, an alternative tree-based machine learning (ML) modeling is explored to predict the load-settlement behavior of axially loaded single piles from cone penetration test (CPT) data. Two variants of tree-based ML models, the random forest (RF) and gradient boosted tree (GBT), are developed in this study to estimate the load-settlement behavior of piles from CPT data (corrected cone tip resistance, qt, and sleeve friction, fs). A database of load-settlement curves of 64 static pile load tests and the corresponding CPT test data were compiled and used for the development of these ML models. The developed RF and GBT models are evaluated based on several statistical criteria. The load-settlement curves for six PLTs predicted using the developed RF and GBT models were compared with the measured data and the load-settlement curves predicted using the conventional load-transfer methods. The results demonstrated the great potential of tree-based ML (RF, GBT) models for predicting the load-settlement behavior of axially loaded piles from CPT data. The comparison clearly shows that the ML models outperformed the conventional load-transfer methods. Amongst the two ML models, the results show that the GBT model outperformed the RF model.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.ailsci.2023.100089
Yoked learning in molecular data science
  • Dec 2, 2023
  • Artificial Intelligence in the Life Sciences
  • Zhixiong Li + 3 more

Yoked learning in molecular data science

  • Research Article
  • Cite Count Icon 13
  • 10.3390/pr12061262
Learning More with Less Data in Manufacturing: The Case of Turning Tool Wear Assessment through Active and Transfer Learning
  • Jun 19, 2024
  • Processes
  • Alexios Papacharalampopoulos + 4 more

Monitoring tool wear is key for the optimization of manufacturing processes. To achieve this, machine learning (ML) has provided mechanisms that work adequately on setups that measure the cutting force of a tool through the use of force sensors. However, given the increased focus on sustainability, i.e., in the context of reducing complexity, time and energy consumption required to train ML algorithms on large datasets dictate the use of smaller samples for training. Herein, the concepts of active learning (AL) and transfer learning (TL) are simultaneously studied concerning their ability to meet the aforementioned objective. A method is presented which utilizes AL for training ML models with less data and then it utilizes TL to further reduce the need for training data when ML models are transferred from one industrial case to another. The method is tested and verified upon an industrially relevant scenario to estimate the tool wear during the turning process of two manufacturing companies. The results indicated that through the application of the AL and TL methodologies, in both companies, it was possible to achieve high accuracy during the training of the final model (1 and 0.93 for manufacturing companies B and A, respectively). Additionally, reproducibility of the results has been tested to strengthen the outcomes of this study, resulting in a small standard deviation of 0.031 in the performance metrics used to evaluate the models. Thus, the novelty presented in this paper is the presentation of a straightforward approach to apply AL and TL in the context of tool wear classification to reduce the dependency on large amounts of high-quality data. The results show that the synergetic combination of AL with TL can reduce the need for data required for training ML models for tool wear prediction.

  • Book Chapter
  • 10.1016/b978-0-443-13242-1.00015-1
Chapter 24 - Overview and comparison of reliability analysis techniques based on multifidelity Gaussian processes
  • Jan 1, 2024
  • Developments in Reliability Engineering
  • Romain Espoeys + 5 more

Chapter 24 - Overview and comparison of reliability analysis techniques based on multifidelity Gaussian processes

  • Research Article
  • Cite Count Icon 36
  • 10.2196/47833
Machine Learning Models for Blood Glucose Level Prediction in Patients With Diabetes Mellitus: Systematic Review and Network Meta-Analysis.
  • Nov 20, 2023
  • JMIR Medical Informatics
  • Kui Liu + 9 more

Machine learning (ML) models provide more choices to patients with diabetes mellitus (DM) to more properly manage blood glucose (BG) levels. However, because of numerous types of ML algorithms, choosing an appropriate model is vitally important. In a systematic review and network meta-analysis, this study aimed to comprehensively assess the performance of ML models in predicting BG levels. In addition, we assessed ML models used to detect and predict adverse BG (hypoglycemia) events by calculating pooled estimates of sensitivity and specificity. PubMed, Embase, Web of Science, and Institute of Electrical and Electronics Engineers Explore databases were systematically searched for studies on predicting BG levels and predicting or detecting adverse BG events using ML models, from inception to November 2022. Studies that assessed the performance of different ML models in predicting or detecting BG levels or adverse BG events of patients with DM were included. Studies with no derivation or performance metrics of ML models were excluded. The Quality Assessment of Diagnostic Accuracy Studies tool was applied to assess the quality of included studies. Primary outcomes were the relative ranking of ML models for predicting BG levels in different prediction horizons (PHs) and pooled estimates of the sensitivity and specificity of ML models in detecting or predicting adverse BG events. In total, 46 eligible studies were included for meta-analysis. Regarding ML models for predicting BG levels, the means of the absolute root mean square error (RMSE) in a PH of 15, 30, 45, and 60 minutes were 18.88 (SD 19.71), 21.40 (SD 12.56), 21.27 (SD 5.17), and 30.01 (SD 7.23) mg/dL, respectively. The neural network model (NNM) showed the highest relative performance in different PHs. Furthermore, the pooled estimates of the positive likelihood ratio and the negative likelihood ratio of ML models were 8.3 (95% CI 5.7-12.0) and 0.31 (95% CI 0.22-0.44), respectively, for predicting hypoglycemia and 2.4 (95% CI 1.6-3.7) and 0.37 (95% CI 0.29-0.46), respectively, for detecting hypoglycemia. Statistically significant high heterogeneity was detected in all subgroups, with different sources of heterogeneity. For predicting precise BG levels, the RMSE increases with a rise in the PH, and the NNM shows the highest relative performance among all the ML models. Meanwhile, current ML models have sufficient ability to predict adverse BG events, while their ability to detect adverse BG events needs to be enhanced. PROSPERO CRD42022375250; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=375250.

  • Preprint Article
  • 10.5194/egusphere-egu22-8321
Partitioning of green-blue water fluxes around the world: ML model explainability and predictability
  • Mar 28, 2022
  • Daniel Althoff + 1 more

<p>The consequences of ever-increasing human interference with freshwater systems, e.g., through land-use and climate changes, are already felt in many regions of the world, e.g., by shifts in freshwater availability and partitioning between green (evapotranspiration) and blue (runoff) water fluxes around the world. In this study, we have developed a machine learning (ML) model for the possible prediction of green-blue water flux partitioning (WFP) under different climate, land-use, and other landscape and hydrological catchment conditions around the world. ML models have shown relatively high predictive performance compared to more traditional modelling methods for several tasks in geosciences. However, ML is also rightly criticized for providing theory-free “black-box” models that may fail in predictions under forthcoming non-stationary conditions. We here address the ML model interpretability gap using Shapley values, an explainable artificial intelligence technique. We also assess ML model predictability using a dissimilarity index (DI). For ML model training and testing, we use different parts of a total database compiled for 3482 hydrological catchments with available data for daily runoff over at least 25 years. The target variable of the ML model is the blue-water partitioning ratio between average runoff and average precipitation (and the complementary, water-balance determined green water partitioning ratio) for each catchment. The predictor variables are hydro-climatic, land-cover/use, and other catchment indices derived from precipitation and temperature time series, land cover maps, and topography data. As a basis for the ML modelling, we also investigate and quantify (through data averaging over moving sub-periods of different time lengths) a minimum temporal aggregation scale for water flux averaging (referred to as the flux equilibration time, T<sub>eq</sub>) required to reach a stable temporal average runoff (and evapotranspiration) fraction of precipitation in each catchment; for 99% of catchments, T<sub>eq</sub> is found to be ≤2 years, with longer T<sub>eq </sub>emerging for catchments estimated to have higher ratio R<sub>gw</sub>/R<sub>avg</sub>, i.e., higher groundwater flow contribution (R<sub>gw</sub>) to total average runoff (R<sub>avg</sub>). The cubist model used for the ML modelling yields a Kling-Gupta efficiency of 0.86, while the Shapley values analysis indicates mean annual precipitation and temperature as the most important variables in determining the WFP, followed by average slope in each catchment. A DI threshold is further used to label new data points as inside or outside the ML model area of applicability (AoA). Comparison between test data points outside and inside the AoA reveals which catchment characteristics are mostly responsible for ML model loss of predictability. Predictability is lower for catchments with: larger T<sub>eq</sub> and R<sub>gw</sub>/R<sub>avg</sub>; higher phase lag between peak precipitation and peak temperature over the year; lower forest and agricultural land fractions; and aridity index much higher or much lower than 1 (implying major water or energy limitation, respectively). Identifying such predictability limits is crucial for understanding, and facilitating user awareness of the applicability and forecasting ability of such data-driven ML modelling under different prevailing and changing future hydro-climatic, land-use, and groundwater conditions.</p>

  • Dissertation
  • 10.14264/uql.2014.263
Machine Learning as an Adjunct to Clinical Decision Making in Alcohol Dependence Treatment
  • Jan 1, 2014
  • Martyn Symons

This thesis investigated the potential for a machine learning (ML) model, instantiated as a Decision Support System (DSS), to assist psychologists when making prospective predictions for alcohol dependence treatment outcomes. Predictions were made for patients undertaking a 12 week, abstinence-based, Cognitive Behavioural Therapy (CBT) program assisted by voluntary adjunctive medication (acamprosate and/or naltrexone). Success was defined as attending all treatment sessions while maintaining abstinence. A series of studies examined: the ability of clinical staff (N▄ 10) to predict treatment outcome on the basis of clinical data alone (Study 1, 50 consecutively treated patients); ML models to predict outcome for the same 50 patients, trained on 780 previously treated patients (Study 2); prospective intuitive psychologist predictions for 220 consecutive patients (Study 3); prospective ML model predictions for the same 220 patients, trained on 1016 previously treated patients (Study 4); and the clinically integrated application and efficacy of a DSS based on a naive Bayesian model to assist psychologists when predicting patient outcome (Study 5). Initially, the mean aggregate accuracy of clinicians when predicting with patient data alone (Study 1, 56.10%), was not significantly (pg.05) different to expected by chance as was the mean aggregate accuracy (Study 2, 58.57%) for ML models on the same 50 patients. However, two clinicians and six ML models were significantly (pl.05) more accurate than expected by chance alone. The maximum accuracy achieved by a psychologist was 66%, and was 78% for a ML model. When prospective predictions were made for 220 patients the mean accuracy of psychologists (Study 3, 56.36%) was not significantly different (pg.05) to chance alone whereas the mean accuracy of the ML models (Study 4, 63.95%) was significantly different (pl.05) for all but the two least accurate models. The 10.59% higher accuracy of the ML models meant that there was a significant (pl.05) difference in accuracy. The best ML model achieved an accuracy of 70.91%. This suggested that ML models were accurate enough to assist psychologists when making predictions to be suitable for a DSS. Psychologist probability estimates for the percentage chance of a successful patient outcome were significantly correlated with outcomes when making intuitive predictions although the correlation (rpb ▄ .163) was low. The findings for the ML models were also encouraging with a significant relationship between probability estimates and outcome at a weak to moderate strength (rpb ▄ .133-.251). Furthermore there was little evidence for overconfidence in psychologistsr predictions as evinced in previously published studies. These results taken together suggest that psychologists could potentially decide when to accept DSS advice in a principled fashion and would be less likely to reject the advice due to high confidence levels. A naive Bayesian model was integrated as a DSS into the normal clinical practice workflow using an intuitive and easy to use graphical interface developed by the author (Study 5). Psychologists initially made an intuitive prediction for their patients after the first session of treatment and had the option to request a DSS prediction. Predictions were requested for 57 of the 106 patients treated during this study. After viewing the DSS prediction psychologists were offered a chance to review their initial choice. The initial accuracy of the DSS (49.12%) was hindered by clinical challenges that arose out of the lin vivor evaluation in the context of a busy public hospital. When tested using an lidealr cleaned data-set post study the potential accuracy of the DSS was 59.65%. However, it performed statically no worse than psychologists (64.91%). A combined voting system, choosing the prediction with the highest estimated probability, would have been the most accurate (66.67%). However, psychologists did not alter any predictions in the final study after viewing the DSS prognosis. Given that the DSS fulfilled the requirements found for successful implementation in medical settings the unique requirements of psychological therapy must be further examined before the successful future deployment of a DSS into a behavioural treatment environment. The variables identified as most important for prediction were significantly different for psychologists and ML feature selection approaches. Furthermore, categories of variables previously unmeasured at the clinical site were identified as important by psychologists including social support, commitment/motivation and short-term drinking history before treatment. Capturing these variables could potentially improve ML accuracy. This thesis demonstrated proof of concept and provided early efficacy data for improving prediction of treatment outcome for alcohol dependence, using novel ML approaches combined with a DSS.

  • Conference Article
  • Cite Count Icon 4
  • 10.2118/215220-ms
Machine Learning Sweet Spot Identification and Performance Validation Utilising Reservoir and Completion Data from Unconventional Reservoir in British Columbia, Canada
  • Oct 6, 2023
  • Junghun Leem + 6 more

Production in an unconventional reservoir varies widely depending on reservoir characteristics (e.g., thickness, permeability, brittleness, natural fracturing), and completion design (e.g., well spacing, frac spacing, proppant volume). A comprehensive method of data analytics and predictive Machine Learning (ML) modeling was developed and deployed in the Montney unconventional siltstone gas reservoir, British Columbia, Canada to identify production zone "sweet spots" from reservoir quality data (i.e., geological, geophysical, and geomechanical) data and completion quality data (e.g., frac spacing, fluid volume, and proppant intensity), which were utilized to enhance and optimize production performance of this unconventional reservoir. Typical data analytics and predictive ML modeling utilizes all the reservoir quality data and completion quality data together. The completion quality data tends to dominate over the reservoir quality data, because of a higher statistical correlation (i.e., weight) of the completion data to observed production. Hence, resulting predictive ML models commonly underestimate the effects of the reservoir quality on production, and exaggerate the influence of the completion quality data. To overcome these shortcomings, the reservoir quality data and the completion quality data are separated and normalized independently. The normalized reservoir and completion quality data are utilized to identify sweet spots and optimize completion design respectively, through predictive ML modelling. This novel methodology of predictive ML modeling has identified sweet spots from key controlling reservoir quality data and as well as prescribed optimal completion designs from key controlling completion quality data. The trained predictive ML model was tested by a blind test (R2=79.0%) from 1-year of cumulative production from 6 Montney wells in the Town Pool, which was also validated by recent completions from 6 other Town Montney Pool wells (R2=78.7%).

  • Research Article
  • Cite Count Icon 2
  • 10.1186/s12874-025-02694-z
Comparison of machine learning methods versus traditional Cox regression for survival prediction in cancer using real-world data: a systematic literature review and meta-analysis
  • Oct 28, 2025
  • BMC Medical Research Methodology
  • Yinan Huang + 6 more

BackgroundAccurate prediction of survival in oncology can guide targeted interventions. The traditional regression-based Cox proportional hazards (CPH) model has statistical assumptions and may have limited predictive accuracy. With the capability to model large datasets, machine learning (ML) holds the potential to improve the prediction of time-to-event outcomes, such as cancer survival outcomes. The present study aimed to systematically summarize the use of ML models for cancer survival outcomes in observational studies and to compare the performance of ML models with CPH models.MethodsWe systematically searched PubMed, MEDLINE (via EBSCO), and Embase for studies that evaluated ML models vs. CPH models for cancer survival outcomes. The use of ML algorithms was summarized, and either the area under the curve (AUC) or the concordance index (C-index) for the ML and CPH models were presented descriptively. Only studies that provided a measure of discrimination, i.e., AUC or C-index, and 95% confidence interval (CI) were included in the final meta-analysis. A random-effects model was used to compare the predictive performance in the pooled AUC or C-index estimates between ML and CPH models using R. The quality of the studies was evaluated using available checklists. Multiple sensitivity analyses were performed.ResultsA total of 21 studies were included for systematic review and 7 for meta-analysis. Across the 21 articles, diverse ML models were used, including random survival forest (N=16, 76.19%), gradient boosting (N=5, 23.81%), and deep learning (N=8, 38.09%). In predicting cancer survival outcomes, ML models showed no superior performance over CPH regression. The standardized mean difference in AUC or C-index was 0.01 (95% CI: -0.01 to 0.03). Results from the sensitivity analyses confirmed the robustness of the main findings.ConclusionsML models had similar performance compared with CPH models in predicting cancer survival outcomes. Although this systematic review highlights the promising use of ML to improve the quality of care in oncology, findings from this review also suggest opportunities to improve ML reporting transparency. Future systematic reviews should focus on the comparative performance between specific ML models and CPH regression in time-to-event outcomes in specific type of cancer or other disease areas.Supplementary InformationThe online version contains supplementary material available at 10.1186/s12874-025-02694-z.

Save Icon
Up Arrow
Open/Close