Abstract

HomeJournal of the American Heart AssociationVol. 1, No. 5Heuristic Methods for Finding Pathogenic Variants in Gene Coding Sequences Open AccessResearch ArticlePDF/EPUBAboutView PDFView EPUBSections ToolsAdd to favoritesDownload citationsTrack citations ShareShare onFacebookTwitterLinked InMendeleyReddit Jump toOpen AccessResearch ArticlePDF/EPUBHeuristic Methods for Finding Pathogenic Variants in Gene Coding Sequences Monique Ohanian, Robyn Otway and Diane Fatkin Monique OhanianMonique Ohanian Molecular Cardiology Division, Victor Chang Cardiac Research Institute, Sydney, New South Wales, Australia Search for more papers by this author , Robyn OtwayRobyn Otway Molecular Cardiology Division, Victor Chang Cardiac Research Institute, Sydney, New South Wales, Australia Search for more papers by this author and Diane FatkinDiane Fatkin Molecular Cardiology Division, Victor Chang Cardiac Research Institute, Sydney, New South Wales, Australia Cardiology Department, St Vincent's Hospital, Sydney, New South Wales, Australia Faculty of Medicine, University of New South Wales, Sydney, New South Wales, Australia Search for more papers by this author Originally published26 Sep 2012https://doi.org/10.1161/JAHA.112.002642Journal of the American Heart Association. 2012;1:e002642IntroductionThese are exciting times, with a plethora of new technologies that are expediting discovery of the genetic underpinnings of human disease. Comprehensive resequencing of the human genome is now feasible and affordable, allowing each person's entire genetic makeup to be revealed. The major focus of attention in genetics studies has been the small portion (1%) of the human genome that comprises the protein‐coding sequences in genes (the “exome”), and the majority of causal disease‐associated variants identified to date have been located in these regions.1 A remarkable extent of genetic variation in the protein‐coding regions has been found, with at least 20 000 single‐nucleotide polymorphisms (SNPs) present even in normal healthy subjects.2, 3 Half these SNPs are nonsynonymous changes that result in an amino acid substitution that could potentially affect protein function. The greatest challenge now facing investigators is data interpretation and the development of strategies to identify the minority of gene‐coding variants that actually cause or confer susceptibility to disease. To address this problem, bioinformatics tools have been developed to predict the likelihood of pathogenicity. A bewildering array of options is available, and users need to be aware of the programs most suited to their needs as well as the strengths and weaknesses of the various methods employed.Here, we provide an introductory overview of some commonly used pathogenicity prediction programs as well as a set of illustrative cardiac examples. This article is tailored for readers who are not bioinformatics experts and is relevant to cardiovascular researchers undertaking human genetics studies as well as to clinicians performing genetic testing. For comprehensive reviews of available methods,4, 5, 6, 7, 8 detailed technical explanations of the bioinformatics and validation of individual programs,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 and comparative analyses in large variant data sets,22, 23, 24, 25, 26, 27, 28 we refer the reader to excellent articles published elsewhere. The important “take‐home” message is that although bioinformatics prediction programs are extremely useful, the results cannot necessarily be taken at face value because all programs have inherent limitations, and additional supporting evidence is required to confirm that predicted deleterious variants have a role in disease processes.Importance of Gene Coding Sequence Variants in Human DiseaseThe Human Gene Mutation Database (HGMD)1 currently lists more than 120 000 variants in more than 4400 genes that have been associated with human diseases. Disease‐associated variants include nonsense variants (amino acid changes that result in a stop codon), variants that create or abolish splice donor or acceptor sites, and insertions or deletions (indels) that shift the protein reading frame. All these types of variants have a high probability of altering protein function. Interpretation of missense SNPs (that change an amino acid but do not result in a stop codon) is far less straightforward and more difficult to predict because of the range of effects they can impart. Missense SNPs in critical residues can have disastrous consequences on protein function or structure. However, missense SNPs may be benign when the amino acid is substituted for another with similar biochemical properties, if the substitution occurs in an evolutionarily nonconserved position, or when the residue is not in a critical structural or functional domain of the protein. The average white individual has ≈10 000 missense SNPs in their exome, of which ≈200 are novel.3 Experimentally elucidating the consequences of each variant using in vitro studies and animal models is the best way to demonstrate functional effects, but this is impractical on a large scale. Reliable and high‐throughput methods for evaluating missense SNPs are clearly required.Steps in Sequence AnalysisA number of different strategies may be used in genetics studies, and the choice of method depends on the population under investigation and the specific questions being addressed. Studies of Mendelian traits in large family kindreds have traditionally involved linkage analysis to define a chromosomal disease locus, followed by resequencing of candidate genes that are located within the interval. In cohorts of small families in which linkage is unable to be done, resequencing of selected candidate genes is often performed. These approaches have led to the discovery of numerous disease genes for a wide range of cardiac (and extracardiac) disorders and have provided a basis for commercial genetic testing (discussed in a later section). Whole‐genome and whole‐exome massive parallel sequencing platforms are now rapidly gaining popularity for discovery of new disease genes and for identification of variants in known disease genes in families. In cohorts of unrelated patients, resequencing of single genes and genome‐wide association studies with SNP arrays have been used to look for rare and common variants that affect disease risk. Although cost is still a factor in large cohort studies, next‐generation sequencing will undoubtedly be used increasingly in this setting.Irrespective of the sequencing method used, the principles of sequence analysis are essentially the same (Figure 1). First, the sequencing output needs to be aligned to a human reference assembly to determine whether there are any differences with the “normal” sequence and to determine the location of variations (gene exon, gene intron, intergenic). Second, the potential effects of variants on the encoded protein need to be determined (eg, nonsynonymous or synonymous amino acid substitution, splice variant, indel, etc). Third, a search is made of publicly available databases, such as dbSNP, 1000 Genomes, and the Exome Sequencing Project, and in some cases, a cohort of healthy control DNA samples may be genotyped to determine whether variants are novel or have been previously reported and the prevalence of the variant allele. Some inferences then need to be made about potential functional effects. For cardiovascular diseases, variants in genes that are expressed in the heart or vasculature and that have relevant functions for the trait under study can be prioritized. However, it is important not to disregard the possibility that cardiac expression or function of some genes may not be recognized. Even after these filtering methods are employed, a long list of “suspicious” variants is likely to remain, and prediction tools have a key role in short‐listing these for further analysis. Bioinformatics tools are heuristic, that is, they combine various types of parameters from multiple sources to infer likely pathogenicity when detailed experimental evaluation of individual variants is unavailable.Download PowerPointFigure 1. Flow chart showing steps for DNA sequence analysis. ESP indicates Exome Sequencing Project; 1000G, 1000 Genomes project.Prediction Methods AvailableIn this review, we have looked at 8 of the currently available prediction tools for nonsynonymous variants to highlight aspects of how these types of programs work and their relative performance. The methods used and parameters assessed in these 8 programs are summarized in Table 1, with some useful notes about inputs and outputs in Table 2.Table 1. Characteristics of 8 Commonly Used Gene Variant Functional Prediction ProgramsProgramsWeb SiteMethodParameters UsedTraining DataReferencePANTHERhttp://www.pantherdb.org/Hidden Markov ModelEvolutionary conservation across multiple protein familiesDisease‐associated mutations from HGMD; presumed neutral variants in dbSNP9, 10SIFThttp://sift.jcvi.org/Conservation of protein homologuesEvolutionary conservation1 Retroviral+2 bacterial mutagenesis data sets; 5218 human disease‐associated SNPs in Swiss‐Prot; 3084 SNPs in dbSNP11, 12, 13Align‐GVGDhttp://agvgd.iarc.fr/GV, GDEvolutionary conservation+biochemical properties (amino acid composition, polarity, volume)Concurrence of unclassified variants with deleterious mutations in BRCA1; 1514 nonsynonymous SNPs in TP53 gene14, 15PMuthttp://mmb2.pcb.ub.es:8080/PMut/Neural networkEvolutionary conservation+structural effects (secondary structure and solvent accessibility)9334 human disease‐associated mutations in 811 proteins from Swiss‐Prot; 11 372 neutral variants from Escherichia coli mutagenesis data set+811 mutation‐associated proteins16SNPs3Dhttp://www.snps3d.org/Support vector machineEvolutionary conservation+structural effects (protein folding)Monogenic disease data from HGMD; 10 263 disease SNPs in 731 genes; 16 682 control SNPs17, 18PolyPhen‐2http://genetics.bwh.harvard.edu/pph2/Naive Bayes classifierEvolutionary conservation+structural effectsa2 Training models: Hum Div (3155 Mendelian disease‐causing variants in UniProt; 6321 presumed nondamaging SNPs) and Hum Var (13 032 human disease‐causing mutations from UniProt; 8946 common human nsSNPs with no link to disease)19MutPredhttp://mutpred.mutdb.orgRandom forestEvolutionary conservationb+structural effectsc+predicted functions26 655 Disease‐associated mutations in HGMD; 23 426 presumed neutral SNPs in Swiss‐Prot20SNPs&GOhttp://snps-and-go.biocomp.unibo.it/snps-and-go/Support vector machineEvolutionary conservation+local sequence+gene ontology score16 330 Disease‐associated SNPs from Swiss‐Prot; 17 432 presumed neutral SNPs from Swiss‐Prot21GD indicates Grantham deviation; GV, Grantham variation; HGMD, human gene mutation database1; MSA, multiple sequence alignment; SNP, single‐nucleotide polymorphism.aPolyPhen2 uses 8 sequence‐based and 3 structure‐based features, including position‐specific independent count score of wild‐type allele, differences in this score between the wild‐type and variant alleles, number of residues observed at the position in the MSA, residue side‐chain volume change, variant position with respect to a protein domain defined by Pfam, variant allele congruency to MSA, sequence identity with closest homologue deviating from wild‐type allele, normalized accessible surface area of amino acid residue, crystallographic β‐factor, and change in accessible surface area propensity for buried residues.bSIFT score, Pfam profile score, and transition frequency (likelihood of observing a given SNP in the UniRef80 database and Protein Data Bank).cPredicted secondary structure, solvent accessibility, transmembrane helices, coiled‐coil structure, stability, B‐factor, and intrinsic disorder.Table 2. Input and Output Characteristics for 8 Common Prediction AlgorithmsProgramsInputAccess to Intermediate InformationOutputProgram‐Recommended Pathogenicity CriteriaPANTHERWT protein sequence (FASTA or plain format), variant/s of interest; MSA is program generatedMSA (and phylogenetic tree)subPSEC score: 0 (benign) to −10 (most deleterious); Pdel: 0 (0%) to 1.0 (100%)subPSEC score: <−3 (50% likelihood of deleterious effects); Pdel >0.5SIFTWT protein sequence (FASTA format) or Clustal‐formatted MSA (WT query sequence must appear first in MSA), variant/s of interest; MSA is program‐ or user generatedMSA (if single query sequence inputted)Scaled probability score: 0 (most deleterious) to 1 (benign); no. sequences at position; median sequence conservationScaled probability score: <0.05Align‐GVGDFASTA‐formatted MSAa (WT query sequence must appear first in MSA), variant/s of interestNoCombined GV+GD risk estimate: C0 (lowest risk) to C65 (highest risk); individual GV and GD scoresIncremental risk estimates: 1.0‐ (C0) to >4.0‐fold (C65)PMutWT protein sequence (FASTA or plain format), or FASTA‐formatted MSA (WT query sequence must appear first in MSA), variant/s of interestPSI‐BLAST raw output (protein family analysis), MSA (FASTA format), PHD raw output (secondary structure and accessibility predictions)Qualitative prediction: neutral or pathogenic; pathogenicity index: 0 (low) to 1.0 (high); reliability: 0 (low) to 9 (high)Pathogenicity index: >0.5; reliability: >5SNPs3DdbSNP, RefSNP or sequence accession number (if variant not present in results list, select protein accession and enter mutation manually); MSA is program generatedMSASVM score: positive (nondeleterious) or negative (deleterious)Negative SVM scorePolyPhen‐2WT protein sequence (FASTA format) or protein identifier, variant position, WT and variant amino acids; MSA is program generated unless downloaded stand‐alone version used to input user‐generated MSAMSA, 3D visualization (if protein structure information available)Qualitative prediction: benign, possibly damaging, probably damaging; Hum Div/Hum Var scores: 0 (benign) to 1.0 (most deleterious); sensitivity: 0 (low) to 1.0 (high); specificity: 0 (low) to 1.0 (high)Probably damaging prediction; HD/HV scores: closer to 1MutPredWT protein sequence in FASTA format, variant/s of interest; MSA is program generatedNo“g” score: 0 (low) to 1 (high); “p” score: 0 (low) to 1 (high)Possibly deleterious (g>0.5), probably deleterious (g>0.75)SNPs&GOUNIPROT accession number, variant position, WT and variant amino acids; MSA is program generatedNoQualitative prediction: neutral or disease related; reliability index: 0 (unreliable) to 10 (reliable)Disease prediction; reliability index: >5General (“g”) score indicates probability that an amino acid substitution is deleterious; MSA, multiple sequence alignment; property (“p”) score, statistical likelihood (P value) that structural and functional properties will be altered; Pdel, deleterious probability; PHD, Profile fed neural network systems from Heidelberg; PSI‐BLAST, Position‐Specific Iterated Basic Local Alignment Search Tool; subPSEC, substitution position‐specific evolutionary conservation score, estimated from the negative logarithm of the probability ratio of wild‐type and mutant amino acids at a specific position; WT, wild type.aExcept for 7 tumor‐related genes in program library.Genome sequences that are highly conserved during evolution are thought to be important for protein function, and disease‐associated mutations tend to be abundant at these sites.4, 5 Many programs, including PANTHER (Protein Analysis Through Evolutionary Relationships)9, 10 and SIFT (Sorts Intolerant From Tolerant amino acid substitutions),11, 12, 13 rely primarily on the extent of sequence conservation of a specific residue, which is assessed by looking at an alignment of the sequences of this region of the protein across a wide range of different species, that is, multiple sequences alignment (MSA). Many programs take factors in addition to evolutionary conservation into consideration. Align‐GVGD14, 15 also looks at the effects of differences that an amino acid substitution would have on the biochemical properties of a residue, such as changes in volume, polarity, and charge. The Grantham Variation (GV) score component of Align‐GVGD reflects the extent of biochemical variation among amino acids at a given position within an MSA, whereas the Grantham Deviation (GD) score reflects the biochemical distance between variant and wild‐type amino acids at a given residue. Several programs, including PMut,16 SNPs3D,17, 18 and PolyPhen‐2,19 use varying combinations of sequence‐based and protein structure‐based features, such as the effect of a variant on protein folding and accessible surface area of the amino acid residue. MutPred20 is an extension of SIFT that differs most significantly from other programs by its incorporation of predicted functional sites, including DNA‐binding residues, catalytic residues, calmodulin‐binding targets, and predicted posttranslational modification (phosphorylation, methylation, ubiquitination, glycosylation) sites. A broad range of additional parameters are also included in SNPs&GO,21 with evaluation of evolutionary data from PANTHER, the sequence environment of a residue (including 18 residues on either side of the variant residue), and a gene ontology (GO) score that derives information about the biological processes, cellular components, and molecular functions of gene products in different species from the GO database. These prediction tools have been benchmarked on large mutation data sets, and although developed for use in classifying human mutations, some of these programs can be applied to bacteria, plants, and other organisms.29Example VariantsTo further illustrate some of the features of these programs, we used them to make predictions about 18 missense variants that we selected as examples, including 9 rare variants that have robust genetic or functional evidence to implicate them as disease causing in various cardiomyopathies and arrhythmias,30, 31, 32, 33, 34, 35, 36, 37 and 9 common variants implicated in disease susceptibility (Table 3).38, 39, 40, 41, 42, 43, 44, 45, 46 The results of these predictions are shown in Table 4. For the 9 rare variants, the number of variants that were accurately predicted as likely to be deleterious ranged from 2 using PANTHER (22%, although predictions were able to be made for only 4 variants) to 8 (89%) with SIFT, PolyPhen‐2, MutPred, and SNPs&GO. The greatest variability was seen with 2 programs, PANTHER and Align‐GVGD, and 3 variants, R403Q MYH7, R92Q TNNT2, and D175N TPMI. For the 9 common variants, with a few exceptions, predictions were overwhelmingly neutral. A closer examination of the factors on which the predictions are based helps to explain these results.Table 3. Nonsynonymous Variants Associated With Cardiac DisordersGeneProteinVariantLocationClinical AssociationGenetic EvidenceFunctional EvidenceReferenceRare variantsLMNALamin A/CN195KCoiled‐coil rod domainDCMFamilyYes30MYH7β‐Myosin heavy chainR403QMyosin head, interacts with actinHCMFamilyYes31MYH7β‐Myosin heavy chainS532PActin‐binding domainDCMFamilyYes32TNNT2Cardiac troponin TR92Qα‐Tropomyosin‐binding domainHCMFamilyYes33TNNT2Cardiac troponin TR141Wα‐Tropomyosin‐binding domainDCMFamilyYes34TPMIα‐TropomyosinD175NTroponin T–binding domainHCMFamilyYes35KCNQ1KCNQ1S140GS1 transmembrane domainAFFamilyYes36KCNQ1KCNQ1Y315SPore‐forming domainLQTSFamilyYes37KCNH2HERGG628SPore‐forming domainLQTSSporadicYes37Common variantsMYH6α‐Myosin heavy chainA1101VCoiled‐coil rod domainHR, PRCase–controlNo38AGTAngiotensinogenM235TPolypeptide chainHTCase–controlYes39NOS3Endothelial NO synthaseE298DNOSIP interaction regionAF, CADCase–controlYes40KCNH2HERGK897TIntracellular C‐terminal domainLQTS, AFCase–controlYes41, 42KCNE1KCNE1S38GExtracellular N‐terminal domainAFCase–controlYes43SCN5ACardiac sodium channelH558RIntracellular repeat I/II linkerAFCase–controlYes44ADRB1β1‐adrenergic receptorS49GExtracellular N‐terminal domainHR, DCMCase–controlYes45ADRB1β1‐adrenergic receptorG389RIntracellular C‐terminal domainHF, AFCase–controlYes45CYP2C9Cytochrome P450 2C9I359LSubstrate recognition site 5Warfarin doseCase–controlYes46AF indicates atrial fibrillation; CAD, coronary artery disease; DCM, dilated cardiomyopathy; HCM, hypertrophic cardiomyopathy; HF, heart failure; HR, heart rate; HT, hypertension; LQTS, long QT syndrome; NO, nitric oxide; NOSIP, eNOS interacting protein; PR, PR interval.Table 4. Predicted Effects* of Rare and Common Nonsynonymous VariantsTable 4. Predicted Effects* of Rare and Common Nonsynonymous VariantsKey Role of Amino Acid Conservation in Predicting PathogenicityAs noted above, sequences that are highly conserved across species are often functionally important, and high prediction success has been achieved for algorithms that predominantly use evolutionary‐based information.9, 10, 11, 12, 13 Sequence‐based methods do have their limitations,47 and this is demonstrated by the predictions generated by PANTHER and Align‐GVGD. Although PANTHER is generally reliable when predictions are obtained,26 it failed to generate predictions for 6 of the 18 variants in our example data set. This may occur if the sequence alignment is poor or when a variant is located at a residue that is not present in a majority of species and hence is unable to be modeled in a Human Markov Model. In Align‐GVGD, we found wide discordance between sequence conservation (GV) and biochemical change (GD) components for several variants that resulted in a neutral prediction. Sequence conservation appeared to have relatively less weighting than biochemical change because neutral predictions were more likely to be obtained when the GV scores were high and the GD scores were zero (eg, R403Q MYH7, S532P MYH7), rather than the converse situation with low GV and high GD scores (eg, N195K LMNA, Y315S KCNQ1). As a general concept, adding protein structural or functional parameters should provide greater predictive accuracy than consideration of sequence conservation alone,27 but this only applies when protein structure or function is known and the relevant databases are up to date. Quite commonly, this information is incomplete or lacking, and the predictions have to rely predominantly on the evolutionary conservation component.The Importance of MSAs in PredictionsThe number of species in an MSA and the evolutionary distance between them heavily influence algorithm accuracy. Evolutionary depth in MSAs is recommended because this potentially provides more information about the extent of conservation. If sequences in the MSA are too similar (eg, dog, pig, human), then variants not normally imparting a functional consequence on the protein will tend to be classified as pathogenic. On the other hand, comparing a broader range of species, such as small rodents (rat, mouse), zebra fish, fly, and worm, may strengthen the case for a variant in a highly conserved residue being pathogenic, but may also produce false negatives if there is divergence in the protein sequences and biological functions of more distantly related species.7 Similarly, there are no clear indications about whether inclusion of different protein isoforms and different members of the same protein family will strengthen or weaken predictions. In 1 comparative study, PolyPhen‐2 appeared to be least susceptible to differences in the MSAs, whereas Align‐GVGD was highly susceptible and had a propensity to call variants as neutral when large numbers of sequences were utilized.27 It has been noted that programs do not always perform best with their own program‐generated MSA and can have more accurate results with gene‐specific MSAs that have been optimized by the user.PANTHER, SNPs3D, MutPred, and SNPs&GO generate MSAs internally and do not allow the option of users creating and submitting their own MSAs. SIFT and PMut internally generate an alignment but also permit user‐generated alignments. The Web‐server version of Polyphen‐2 has its own alignment pipeline, but user‐generated alignments can be submitted to the stand‐alone software version, which can be downloaded onto a local computer. Align‐GVGD has a very limited set of alignments, so users mostly need to supply their own. This enables greater control of user‐defined sequences in the alignment and flexibility of adding or removing sequences in the MSA, but entails considerable additional work to obtain and align the relevant protein sequences. There is also the real possibility of skewing the results by variations in the numbers and types of species selected to be included in the MSA.MSAs can be obtained from the Pfam (protein families) database48 or manually curated and then aligned using freely available online alignment tools such as the more widely used programs ClustalW2,49 MAFFT,50 MUSCLE,51 PROMALS,52 and T‐Coffee.53 Alignments produced by the different programs for specific regions can differ, however, and it has been suggested that more than 1 MSA program may be required, particularly for sequences that contain deletions or insertions. A number of scoring systems have been devised to assess the quality of MSAs, with the overall conclusion that, like the protein prediction programs available, a single flawless method is not available.54, 55, 56Location, Location, LocationSignificant discrepancies between bioinformatics predictions and experimentally validated effects often arise because the functional characteristics of the region in which a variant is located are inadequately taken into account. Amino acid changes that have modest pathogenicity predictions may nevertheless have a substantial impact if they occur in critical regions of a protein, such as those involved in protein–protein interactions or posttranslational modification. Conversely, variants predicted to be pathogenic because of extensive biophysical modification of a residue may have no effects if this occurs in a relatively unimportant region. Although these issues are addressed in part by MutPred and SNPs&GO, which incorporate some functional parameters, lack of consideration of gene‐specific functional effects is a universal limitation.Examples of the importance of the protein “neighborhood” are provided by the R403Q MYH7, R92Q TNNT2, and D175N TPMI variants. The Arg403Gln mutation in the gene encoding myosin heavy chain (MYH7) causes hypertrophic cardiomyopathy in humans and in mice.31 The R403 residue is located in the myosin head adjacent to the actin‐binding site and is invariant in myosin heavy chains in the heart and other tissues across a range of species from human to amoeba.31 Although this high degree of sequence conservation and the biophysical effects of loss of an arginine are able to be assessed in the prediction algorithms, none of the programs would have considered the key role of the 403 residue in actin–myosin interaction, calcium sensitivity, and energy utilization. A similar argument can be made for the R92Q TNNT2 variant, which is in the elongated tail domain of cardiac troponin T at the site where the tropomyosin monomers overlap. This variant has been shown to have distinct effects on calcium sensitivity and thin filament sliding speed in vitro and results in a hypertrophic cardiomyopathy phenotype in mice,32 yet only 4 of the 8 programs used predicted it to be probably (n=3) or possibly (n=1) deleterious. The D175N TPMI variant, located in the troponin T–binding site in tropomyosin, was also only identified by 5 of the 8 programs as probably (n=4) or possibly (n=1) deleterious despite robust genetic and in vivo functional evidence of pathogenicity.35Rare Versus Common VariantsGenetic variation is being recognized increasingly to play a role in many cardiovascular disorders.57, 58 At one end of the spectrum, single‐gene variants that have a large functional effect have been considered sufficient to cause disease in families with Mendelian patterns of inheritance. These variants are typically rarely present in the general population, and many are “private” mutations seen only in 1 family. Single rare variants have been associated with numerous heritable cardiomyopathies and arrhythmias, including familial hypertrophic cardiomyopathy, familial dilated cardiomyopathy, arrhythmogenic right ventricular cardiomyopathy and long QT syndrome. In contrast, commonly occurring genetic variants have been associated with complex traits such as hypertension, coronary artery disease, diabetes, and atrial fibrillation (the common disease, common variant hypothesis). Common SNPs can be identified by genome‐wide association studies in large cohorts of affected and unaffected individuals. These types of variants are potentially important because of their relatively high‐population frequencies, although the risks associated with each variant may only be modest. Recently, human genome sequencing studies have heightened interest in the potential role of rare variants in common diseases.3, 59, 60, 61, 62, 63 A new paradigm has been proposed in which the cumulative burden of unique personal combinations of rare variants may contribute substantially to the heritable component of complex disease.These perspectives on the role of genetics need to be kept in mind when considering the performance of gene variant functional predictions. A striking finding in our example variants was the differences between predictions for rare and common variants. Whereas the known functional rare variants were correctly predicted by a majority of programs as deleterious, the common variants were mostly predicted as nondeleterious. There are several factors that might explain this discrepancy. First, it is important to note that common SNPs that show significant associations with disease in genomewide association studies are almost always not the causal variants themselves but are markers for a pathogenic SNP that is coinherited in the same haplotype. For example, A1101V MYH6 was significantly associated with

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call