RECENT TECHNOLOGICAL ADVANCES MAY SOON ENABLE THE study of hundreds of thousands of human singlenucleotide polymorphisms (SNPs) at the population level. Because strategies for analyzing these data have not kept pace with the laboratory methods that generate the data, however, it is unlikely that these advances will immediately lead to an improved understanding of the genetic contribution to common human diseases. In addition, the underlying genetics of common diseases such as sporadic breast cancer or essential hypertension are far more complex than that of rare mendelian diseases such as cystic fibrosis and sickle cell anemia. As a result, several important technical challenges will need to be overcome to identify susceptibility genes that can be used to improve the prevention, diagnosis, and treatment of common diseases. These challenges include developing statistical methods to analyze genetic data, selecting appropriate genetic variables, and interpreting interactions between individual genes. Although specific DNA sequence variations have been linked to a variety of rare diseases, they have not been as informative for predicting the onset of more common conditions. This difference is illustrated when comparing familial (rare) and sporadic (common) forms of breast cancer. Women with a strong family history for breast cancer, for example, can be tested for specific mutations in the BRCA1 and BRCA2 genes, which result in 50% chance of developing the disease. The risk for sporadic breast cancer, however, cannot adequately be predicted by DNA sequence variations alone. Similar to other common diseases, the underlying genetic etiology of sporadic breast cancer probably involves many genes, each of which influences susceptibility primarily through nonadditive interactions with other genes (termed “epistasis”) and with environmental factors. It is possible that interactions between genes are ubiquitous in the underlying etiology of most common diseases, given the complex molecular interactions that occur during biological processes such as transcription, translation, and signal transduction. Knowledge about DNA sequence variations from many different genes, in the context of environmental exposure, may thus be necessary to apply genetic information to human health. This problem suggests several challenges in identifying susceptibility genes from the entire human genome. First, powerful statistical and computational methods will need to be developed to model the relationship between combinations of SNPs and disease susceptibility. Due to the large number of potential genotypes, analyzing SNP combinations is a far more difficult task than assessing each SNP individually. The difficulty increases exponentially with the number of SNPs under consideration. For example, while a single SNP with 3 genotypes has only 3 categories, 2 SNPs with 3 genotypes will have 9 possible 2-locus genotype combinations. With 3 SNPs, the number of combinations increases to 27. As the number of possible combinations increases, it may become impossible to recruit enough subjects into epidemiological studies to represent every possible genotypic combination. This problem has been referred to as the “curse of dimensionality.” This limitation may be partially addressed with statistical and modeling approaches. Traditional parametric statistical approaches (ie, methods that compute estimates of population parameters) such as logistic regression do not deal with the dimensionality problem very effectively and are thus not well suited to detecting and characterizing gene-gene interactions. This is due to the inaccuracy of parameter estimates when there are too many variables in relation to the amount of data. By contrast, nonparametric “data-mining” methods usually do not require a prespecified statistical hypothesis and thus are better suited to search for trends or patterns in high-dimensional data sets. Although data-mining techniques such as multifactor dimensionality reduction (MDR) and neural networks may be more powerful than parametric statistical approaches, they have their own limitations. Neural network models, for example, can be very difficult to interpret and their results may not be intuitive. Furthermore, data-mining approaches may be influenced by chance patterns in data, which can result in false-positive results. A second challenge is the selection of genetic variables that should be included for analysis. If complex interactions between genes explain most of the heritability of common diseases, then combinations of SNPs will need to be evaluated from a list of hundreds of thousands of candidates. When single, functional polymorphisms each have a statistically detectable independent effect, each polymorphism can be evaluated individually for an association with disease, followed by an analysis of gene-gene interactions that considers only those polymorphisms. This greatly reduces the number of potential combinations of variables that must be examined. When SNPs do not have independent effects, however, it is impossible for most current computer technologies to analyze the resulting astronomical number of possible combinations. For instance, if 300000 SNPs have been measured at a density of 1 SNP every 10 kilobases (kb), and if 10 statistical evaluations can be computed each second, then evaluation of each individual SNP would require 30000 seconds (ie, 8.3 hours) of computer time. Exhaustive evaluation of the approximately 4 10
Read full abstract