Abstract

This study uses an empirical analysis to quantify the downstream analysis effects of data pre-processing choices. Bootstrap data simulation is used to measure the bias-variance decomposition of an empirical risk function, mean square error (MSE). Results of the risk function decomposition are used to measure the effects of model development choices on model bias, variance, and irreducible error. Measurements of bias and variance are then applied as diagnostic procedures for model pre-processing and development. Best performing model-normalization-data structure combinations were found to illustrate the downstream analysis effects of these model development choices. In additions, results found from simulations were verified and expanded to include additional data characteristics (imbalanced, sparse) by testing on benchmark datasets available from the UCI Machine Learning Library. Normalization results on benchmark data were consistent with those found using simulations, while also illustrating that more complex and/or non-linear models provide better performance on datasets with additional complexities. Finally, applying the findings from simulation experiments to previously tested applications led to equivalent or improved results with less model development overhead and processing time.

Highlights

  • Results found from simulations were verified and expanded to include additional data characteristics by testing on benchmark datasets available from the UCI Machine Learning Library

  • Popularized in the work by David Holpert and William Macready, the No Free Lunch (NFL) Theorem states that no single machine learning algorithm is better than all the others on all problems [1]

  • Benchmark datasets were selected from the UCI Machine Learning Library to cover data types similar to those covered in the simulations

Read more

Summary

Introduction

Introduction Popularized in the work by David Holpert and William Macready, the No Free Lunch (NFL) Theorem states that no single machine learning algorithm is better than all the others on all problems [1]. A study published in May 2020 expands on Carp’s findings, noting that fMRI analyses conducted on the same data by seventy different laboratories produced a wide range of results [5]. This particular study highlighted the fact that fMRI analysis requires several stages of pre-processing and analysis to determine which areas of the brain show activity. They found that the choice of pre-processing pipeline led to widely varied results.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call