Addressing the Links Between Dimensionality and Data Characteristics in Gene-Expression Microarrays

J Salvador Sánchez,Vicente García

doi:10.1145/3230905.3230909

Abstract

In gene-expression microarray data sets each sample is defined by hundreds or thousands of measurements. High-dimensionality data spaces have been reported as a significant obstacle to apply machine learning algorithms, owing to the associated phenomenon called 'curse of dimensionality'. Therefore the analysis (and interpretation) of these data sets has become a challenging problem. The hypothesis set out in this paper is that the curse of dimensionality is directly linked to other intrinsic data characteristics, such as class overlapping and class separability. To examine our hypothesis, here we have carried out a series of experiments over four gene-expression microarray databases because these data correspond to a typical example of the so-called 'curse of dimensionality' phenomenon. The results show that there exist meaningful relationships between dimensionality and some specific complexities that are inherent to data (especially, class separability and geometry of manifolds). Moreover, it is also discussed the behavior of three classifiers as a function of dimensionality and data complexities.

Full Text