Validation Guideline for Small Scale Dataset Classification Result in Medical Domain

Ee Kim Hwe,Zeratul Izzah Mohd Yusoh

doi:10.1007/978-3-319-76351-4_28

Abstract

Datasets are very important information for analysis in any field, but the resources and availability of datasets varies widely. In the medical domain, the difference in the size of datasets can vary from millions to the tens of data. The importance is emphasised on the analysis of large scale datasets because it can provide a large overview of the situation. However, there are some cases where there are only small datasets available, due to some constraints such as high cost of collecting the data and the long hours needed to gather data. This research is to provide a clustering validation guideline on the small scale dataset for future users to verify the usability in the clustering result of small scale dataset. The domain focus is fetal cardiotocography and the small scale dataset with 4 different size will be used. These four datasets will be compared with a large scale dataset that has 2126 data. K-Means is chosen as the clustering technique as it is widely used especially in medical field. Six validation indexes are selected to validate the clustering K-Means technique for all datasets. The result will be obtained and tested in Anderson-Darling test in order to get the normality test result. The guidelines continues with a choice of statistical test either the non-parametric statistical test, Wilcoxon Signed Rank test or the parametric statistical test, Paired-Sample T-Test. Lastly, statistical test result will also be verified with a threshold value to determine the validity of a small scale dataset.

Full Text