Abstract

BackgroundSince the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park et al. (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations.ResultsIn this paper, we propose to first rank each feature (SNP) using a ranking measure, i.e., a modified t-test or F-statistics. Then from the ranking list, we form different feature subsets by sequentially choosing different numbers of features (e.g., 1, 2, 3, ..., 100.) with top ranking values, train and test them by a classifier, e.g., the support vector machine (SVM), thereby finding one subset which has the highest classification accuracy. Compared to the classification method of Park et al., we obtain a better result, i.e., good classification of the 3 populations using on average 64 SNPs.ConclusionExperimental results show that the both of the modified t-test and F-statistics method are very effective in ranking SNPs about their classification capabilities. Combined with the SVM classifier, a desirable feature subset (with the minimum size and most informativeness) can be quickly found in the greedy manner after ranking all SNPs. Our method is able to identify a very small number of important SNPs that can determine the populations of individuals.

Highlights

  • Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual

  • Bafna et al [2] and Halldrsson et al [3] proposed to select a subset of tag SNPs with the minimum size and highest informativeness value calculated from a self-defined informativeness measure, which evaluates how well a single SNP or a set of SNPs predict another single SNP or another set of SNPs within the neighborhoods

  • Eran et al [4] proposed to select the informative SNPs with the maximum prediction accuracy, which is obtained from a prediction accuracy measure evaluating how well the value of an SNP is predicted by the values of only two closest tag SNPs

Read more

Summary

Introduction

Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. When any one single nucleotide of A, T, C and G in the genome sequence is replace by one of any other 3 nucleotide, e.g., from AAATCCGG to AAATTCGG, we call this single base variation (C ⇒ T) as a single nucleotide polymorphism (SNP) It has the following three characteristics [1]: 1) very common in the human genome (a SNP occurs every 100 to 300 bases along the 3-billion-base human genome); 2)among the SNPs, two of every three SNPs are the variations from cytosine (C) to thymine (T); 3) very stable from generation to generation. Redundancy was measured by feature similarity between two features, i.e., the linkage disequilibrium (LD) measure γ2 [5]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.