Automated data preparation for in vivo tumor characterization with machine learning.

Denis Krajnc,Laszlo Papp,Tatjana Traub-Weidinger,Alexander R Haug,Clemens P Spielvogel,Marko Grahovac,Hussain Alizadeh,Zsombor Ritter,Boglarka Ecsedi,Sazan Rasul,Thomas Beyer,Nina Poetsch,Marcus Hacker

doi:10.3389/fonc.2022.1017911

Abstract

BackgroundThis study proposes machine learning-driven data preparation (MLDP) for optimal data preparation (DP) prior to building prediction models for cancer cohorts.MethodsA collection of well-established DP methods were incorporated for building the DP pipelines for various clinical cohorts prior to machine learning. Evolutionary algorithm principles combined with hyperparameter optimization were employed to iteratively select the best fitting subset of data preparation algorithms for the given dataset. The proposed method was validated for glioma and prostate single center cohorts by 100-fold Monte Carlo (MC) cross-validation scheme with 80-20% training-validation split ratio. In addition, a dual-center diffuse large B-cell lymphoma (DLBCL) cohort was utilized with Center 1 as training and Center 2 as independent validation datasets to predict cohort-specific clinical endpoints. Five machine learning (ML) classifiers were employed for building prediction models across all analyzed cohorts. Predictive performance was estimated by confusion matrix analytics over the validation sets of each cohort. The performance of each model with and without MLDP, as well as with manually-defined DP were compared in each of the four cohorts.ResultsSixteen of twenty established predictive models demonstrated area under the receiver operator characteristics curve (AUC) performance increase utilizing the MLDP. The MLDP resulted in the highest performance increase for random forest (RF) (+0.16 AUC) and support vector machine (SVM) (+0.13 AUC) model schemes for predicting 36-months survival in the glioma cohort. Single center cohorts resulted in complex (6-7 DP steps) DP pipelines, with a high occurrence of outlier detection, feature selection and synthetic majority oversampling technique (SMOTE). In contrast, the optimal DP pipeline for the dual-center DLBCL cohort only included outlier detection and SMOTE DP steps.ConclusionsThis study demonstrates that data preparation prior to ML prediction model building in cancer cohorts shall be ML-driven itself, yielding optimal prediction models in both single and multi-centric settings.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in oncology	Publication Date: Oct 11, 2022
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Automated data preparation for in vivo tumor characterization with machine learning.

Abstract

Talk to us

Similar Papers

More From: Frontiers in oncology

Lead the way for us

Similar Papers

100. Availability and reporting quality of external validations of ML prediction models with orthopedic surgical outcomes: A systematic review
Olivier Groot ... Joseph H Schwab
The Spine Journal | VOL. 21
Olivier Groot, et. al.Olivier Groot ... Joseph H Schwab
10 Aug 2021
The Spine Journal | VOL. 21

Availability and reporting quality of external validations of machine-learning prediction models with orthopedic surgical outcomes: a systematic review
Olivier Q Groot ... Joseph H Schwab
Acta Orthopaedica | VOL. 92
Olivier Q Groot, et. al.Olivier Q Groot ... Joseph H Schwab
18 Apr 2021
Acta Orthopaedica | VOL. 92

An Application of Machine Learning to Etiological Diagnosis of Secondary Hypertension: Retrospective Study Using Electronic Medical Records.
Xiaolin Diao ... Jun Cai
JMIR medical informatics | VOL. 9
Xiaolin Diao, et. al.Xiaolin Diao ... Jun Cai
25 Jan 2021
JMIR medical informatics | VOL. 9

Diagnostic Performance of a Noninvasive Breath Test for Colorectal Cancer: COBRA1 Study
Georgia Woodfield
Gastroenterology | VOL. 163
Georgia WoodfieldGeorgia Woodfield
05 Jul 2022
Gastroenterology | VOL. 163

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automated data preparation for in vivo tumor characterization with machine learning.

Abstract

Talk to us

Similar Papers

More From: Frontiers in oncology