Recently, the world-wide human genome-related projects have been vigorously launched and implemented. Gene-sequencing techniques play a critical role in disease diagnosis, prediction, and population stratification relying on efficiently mining genetic features in the gene pool. Exploring the association between the sites of the genetic mutation and the disease-based population classification becomes a hot topic, which beneficially supports disease diagnosis and treatment on the molecular level. However, there are numerous variable sites even on a single chromosome in the human gene pool, and hence, the traditional classifiers are not able to dig out all single nucleotide polymorphism (SNP) sites without clearly excavating the characteristic SNP sites, termed tagSNPs, in SNP clusters. By applying big data mining techniques, in this paper, we, first of all, propose a principal component analysis-based algorithm for reducing the gene data dimension in order to cluster SNP sites in the low-dimensional space. Moreover, an oriented graph theory-based tagSNPs selection algorithm is designed. Finally, relying on the real-world 1000 Genomes Project dataset, we can achieve fewer tagSNPs than the traditional methods by invoking the complete process of our designed SNP classifier.
Read full abstract