Over the past few years, single nucleotide polymorphisms (SNPs) have been proposed as the next generation of markers for the identification of loci associated with complex diseases and for pharmacogenetic applications (Lander and Schork 1994; Lander 1996; Risch and Merikangas 1996; Kruglyak 1997; Schafer and Hawkins 1998). SNPs are frequently present in the genome with a density of at least one common (>20% allele frequency) SNP per kilobase pair (Lai et al. 1998; Sachidanandam et al. 2001). They are mostly biallelic ( 1.6 million SNPs in the public databases (Sachidanandam et al. 2001). In this article, I will attempt to summarize what we know about SNPs and identify some of the challenges that await us in the application of SNPs in research and medicine. The first questions most people would ask are, how many SNPs are there in the human genome and have we identified most of the SNPs? The frequently cited rate of 1 SNP/kb suggests that there are 3 million common SNPs in the human genome. However, recent data have indicated that the number of SNPs in the human genome is potentially much more than 3 million. The first indication came from the comparison of the Celera SNP database with the public data. Celera Genomics claimed to contain over 3.5 million putative SNPs in their database. However, only 400,000 of their SNPs were redundant when compared to the publicly available 1.6 million. The second line of evidence came from our own experiments. We have isolated >1000 SNPs in a 20megabase region by re-sequencing eight individuals (not the same DNA source as the TSC SNPs). The overlap between our SNPs (∼1,000) and the TSC SNPs in this region is ∼5% (instead of the expected 50% if the total number of common SNP is around 3 million). These results suggest that there are potentially 10 million or more common SNPs in the human population. A theoretical modeling experiment has also predicted that there are more than 10 million SNPs in the genome (Kruglyak and Nickerson 2001). There are two important implications in the usage of SNPs as a genetic tool if there are indeed over 10 million SNPs in the human genome. The first implication is that the SNP(s) you are looking for might not be discovered yet. The second implication is the need to select a representative set of SNPs out of the 1.6 million to cover the genome. The first problem is a difficult one since it is impossible to know whether the SNP(s) of interest is present in the current databases. There are two potential solutions. The first solution is to design experiments that combine SNP discovery and genotyping (Brenner et al. 2000). However, this approach has not been demonstrated for whole genome SNP scan and could be costly even if it is technically feasible. The second solution, which is suitable for both implications mentioned above, is the development of a comprehensive whole genome SNP marker set that has a high likelihood of detecting the SNP(s) of interest by linkage disequilibrium or association (see section below on marker set development) (Jorde 2000). So how do we design a marker set that covers the genome as completely as possible? There are many suggestions and computer models using linkage disequilibrium (LD) as a guide and striking a balance between number of markers and information content (Kruglyak 1999; Jorde 2000). A number of recent studies have indicated that an average spacing of 30 kb provides a good balance (i.e., 100,000 SNPs for whole genome) (Collins 1999; Huttley et al. 1999; Goddard et al. 2000; Jorde 2000). In addiE-MAIL ehl21107@GlaxoWellcome.com; FAX (919) 315-0113. Article and publication are at www.genome.org/cgi/ doi/10.1101/gr.192301. Insight/Outlook