Abstract

Biomedicine is a field rich in a variety of heterogeneous, evolving, complex and unstructured massive data, coming from autonomous sources (i.e. HACE theorem). Big data mining has become the most fascinating and fastest growing area which enables the selection, exploring and modeling the vast amount of medical data to help clinical decision making, prevent medication error, and enhance patients' outcomes. Given the complexity and unstructured data nature in biomedicine, it was acknowledged that there is no single best data mining method for all applications. Indeed, an appropriate process and algorithm for big data mining is essential for obtaining a truthful result. Up to date, however, there is no guideline for this, especially about a fair sample size in the training set for reliable results. Sample size is of central importance because the biomedical data don't come cheap — they take time and human power to acquire the data and usually are very expensive. On the other hand, small sample size may result in the overestimates of the predictive accuracy by overfitting to the data. The purpose of this paper is to provide a guideline for determining the sample size that can result in a robust accuracy. Because the increment in data volume causes complexity and had a significant impact on the accuracy, we examined the relationship among sample size, data variation and performance of different data mining methods, including SVM, Naive Bayes, Logistic Regression and J48, by using simulation and two sets of biomedical data. The simulation result revealed that the sample size can dramatically affect the performance of data mining methods under a given data variation and this effect is most manifest in nonlinear case. For experimental biomedical data, it is essential to examine the impact of sample size and data variation on the performance in order to determine the sample size.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call