Abstract

The size of today's biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method. By repeating the random sampling and comparing the distribution of the drawn sample with the distribution of the original data, it was possible to establish a method for obtaining subsets of data that better reflect the entire data set than taking only the first randomly selected subsample, as is the current standard. Experiments on artificial and real biomedical data sets showed that the reconstruction of the remaining data from the original data set from the downsampled data improved significantly. This was observed with both principal component analysis and autoencoding neural networks. The fidelity was dependent on both the number of cases drawn from the original and the number of samples drawn. Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method. By using distributional similarity as the only selection criterion, the proposed method does not in any way affect the results of a later planned analysis.

Highlights

  • With the development of biomedical research over the past two decades, data sets are becoming increasingly larger

  • Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method

  • Optimal distribution-preserving data downsampling sources indicated with their description in this report

Read more

Summary

Introduction

With the development of biomedical research over the past two decades, data sets are becoming increasingly larger. The number of operations performed in the analysis of large biomedical data sets and the size of the data stored can exceed the capacity of today’s computers. This can happen even with seemingly simple tasks such as data projection for visualization. The number of unique distances, ndist, between data points that need to be calculated for this task is proportional to the square of the number of instances; precisely, ndist 1⁄4 Oðn2Þ n2 À 2 n For data points, this gives values; for

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call