Synergistic effects between data corpora properties and machine learning performance in data pipelines

Roberto Bertolini,Stephen J Finch

doi:10.1504/ijdmmm.2022.125261

Abstract

To analyse data, a computationally feasible pipeline must be developed for data modelling. Corpora properties affect performance variability of machine learning (ML) techniques in pipelines; however, this has not been thoroughly investigated using simulation methodologies. A Monte Carlo study is used to compare differences in the area under the curve (AUC) metric for large-n-small-p-corpora examining: 1) the choice of ML algorithm; 2) size of the training database; 3) measurement error; 4) class imbalance magnitude; 5) missing data pattern. Our simulations are consistent with established results under which these algorithms and corpora properties perform best, while providing insights into their synergistic effects. Measurement error negatively impacted pipeline performance across all corpora factors and ML algorithms. A larger training corpus ameliorated the decrease in predictive efficacy resulting from measurement error, class imbalance magnitudes, and missing data patterns. We discuss the implications of these findings for designing pipelines to enhance prediction performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Synergistic effects between data corpora properties and machine learning performance in data pipelines

Abstract

Talk to us

Similar Papers

More From: International Journal of Data Mining, Modelling and Management

Lead the way for us

Similar Papers

Can we explain machine learning-based prediction for rupture status assessments of intracranial aneurysms?
N Mu ... J Tang
Biomedical Physics & Engineering Express | VOL. 9
N Mu, et. al.N Mu ... J Tang
10 Mar 2023
Biomedical Physics & Engineering Express | VOL. 9

Machine learning to predict in-hospital mortality risk among heterogenous STEMI patients with diabetes
S Kasim ... K S Ibrahim
European Heart Journal | VOL. 43
S Kasim, et. al.S Kasim ... K S Ibrahim
04 Feb 2022
European Heart Journal | VOL. 43

Racial and Ethnic Disparities in Predictive Accuracy of Machine Learning Algorithms Developed Using a National Database for 30-Day Complications Following Total Joint Arthroplasty
Christian A Pean ... Young-Min Kwon
The Journal of Arthroplasty | VOL. -
Christian A Pean, et. al.Christian A Pean ... Young-Min Kwon
01 Oct 2024
The Journal of Arthroplasty | VOL. -

An External-Validated Prediction Model to Predict Lung Metastasis among Osteosarcoma: A Multicenter Analysis Based on Machine Learning.
Wenle Li ... Fida Hussain Memon
Computational Intelligence and Neuroscience | VOL. 2022
Wenle Li, et. al.Wenle Li ... Fida Hussain Memon
06 May 2022
Computational Intelligence and Neuroscience | VOL. 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Synergistic effects between data corpora properties and machine learning performance in data pipelines

Abstract

Talk to us

Similar Papers

More From: International Journal of Data Mining, Modelling and Management