Abstract

Genome-wide genotyping and DNA sequencing has led to the identification of large numbers of genetic variants that are associated with many clinical phenotypes. The functional impacts of most of the variants are unknown. In this article, we review high-throughput assays that have been developed to assess a variety of the functional impacts of the variants. A better understanding of their functions should facilitate the implementation of many more variants in genomic-driven medicine. A cornerstone of precision medicine is the incorporation of genetic information into healthcare decisions. This approach relies on understanding the genome complexity, the genetic differences that exist between individuals, and the functional consequences of the genetic variants. In the personal genome era, improvements in sequencing technologies are leading to continuous identification of new variants and further illustrating the complexity of the human genome and the genetic diversity between populations. Large-scale high-throughput sequencing studies, such as the 1000 Genomes and NHLBI GO Exome Sequencing Projects, have already identified millions of genetic variants among individuals from different populations and have established a comprehensive resource on human genetic variation.1, 2 The genetic variants are cataloged in public databases, such as dbSNP (https://www.ncbi.nlm.nih.gov/snp/) and dbVAR (https://www.ncbi.nlm.nih.gov/dbvar/) (Table 1). The current build of dbSNP (build 147, updated on 14 April 2016) contains ∼154 million single nucleotide variants (SNVs) of which about 101 million have been validated and nearly 89 million are within genes. dbVAR (updated on 28 September 2016) contains ∼5 million structural variants and ∼2.3 million, 1.3 million, and 1.2 million of these variants are contributed by copy number variants, short tandem repeats, and insertions, respectively. In the 1000 Genomes Project, sequencing was carried out on 2,504 individuals from 26 populations in Africa, East Asia, Europe, South Asia, and the Americas. More than 88 million variants were identified, of which 84.7 million were single nucleotide polymorphisms (SNPs), 3.6 million were short indels, and 60,000 were structural variants. Only 8 million of the identified autosomal variants were observed in more than 5% of individuals, while 64 million rare variants (frequency of <0.5%) were identified. Substantial differences exist in the distribution of variants between populations, with 762,000 variants being rare (frequency of <0.5%) in the global population, but more common (>5%) in at least one population group. Eighty-six percent of variants were only present in a single continental group. Sequencing of individuals from South Asian and African populations contributed to 24% and 28%, respectively, of novel variants discovered.1 Sudmant et al.3 reported the identification of 68,818 structural variants when analyzing sequencing data from the 1000 Genomes Project. The majority of these structural variants are deletions (42,279) with a median site size of 2,455 bp and median alleles per individual of 2,788. The nucleotide substitution rate is an important factor underlying the degree of genetic variation between individuals. Scally4 reported a present-day germline mutation rate of 0.5 × 10−9bp−1year−1. This mutation rate translates into ∼30 de novo variants in each offspring that are absent in the parents. The introduction of 30 new DNA variants with every meiosis event over a period of 3.7–6.6 million years (evolution of the human species) and rapid expansion of the human population during the last 10,000 years resulted in the observed enormous diversity of the human genome. For most of the genetic variants, the impact on gene function and the effect on disease susceptibility remains unknown. High-throughput sequencing continues to produce a more accurate estimation of how much genetic variation exists within and between genomes of individuals of different ethnicities. Typically, each genome has 4–5 million sites that differ from the reference human genome; the greatest number of variant sites were observed among individuals of African ancestry. Although SNPs and indels account for >99.9% of variants, the typical genome contains 2,100–2,500 structural variants that affect about 20 million bases of sequence. Deep sequencing allows for the identification of rare variants and an estimated 1–4% of variants (40,000–200,000) observed in a genome are rare (frequency of <0.5%). A typical genome reportedly contained 149–182 sites with protein truncation variants, 10,000–12,000 sites with nonsynonymous variants and 459,000–565,000 variant sites within regulatory regions (untranslated regions, promoters, insulators, enhancers, and transcription factor binding sites). The number of ClinVar variants (those associated with clinical phenotypes) within a typical genome range from 24–30.1 Tennessen et al.2 suggested that 2.3% of SNVs per individual exome are thought to disrupt protein function of about 313 of the 23,500 protein-coding genes and nearly 96% of SNPs predicted to affect gene function are rare. (Figure 1) is a representation of functionally important regulatory and gene regions with the number of variants within these regions for a typical genome. Genome-wide association studies (GWAS) have been used to determine which of the identified variants are associated with diseases. To date, more than 3,200 GWA studies have been conducted (http://www.ebi.ac.uk/gwas) and ∼10,000 common SNPs have been associated with human traits and diseases through GWA studies.5 Gusev et al.6 estimated that ∼80% of phenotypic heritability of common diseases and traits are explained by variants in noncoding regulatory regions. Approximately 2,000 variants per genome have been associated with complex traits in GWA studies.1 However, testing of such a large number of SNPs in a GWA study requires correction for multiple testing to decrease the number of false-positive associations by using very stringent significance thresholds. Bonferroni correction for multiple testing (0.05/number of tests) is often used, but it can result in overcorrection and, thus, miss SNPs that really are associated with the phenotype.7 A large number of study participants are also needed to identify rare causal variants with the use of GWAS.8 Genetic variants impact drug metabolism, efficacy, and adverse event risk and are especially relevant to precision medicine. Fujikura et al.9 analyzed sequencing data from the 1000 Genomes and the NHLBI GO Exome Sequencing Projects; they reported a total of 6,165 SNVs in the 57 cytochrome P450 (CYP) genes. Eighty-three percent of the 4,025 SNPs within the coding regions were very rare (frequency of <0.1%) and 65% were nonsynonymous substitutions. The calculated total number of genetic variations in CYP genes of 1 million Europeans and Africans was 3.4 × 104 and 4.8 × 104, respectively.9 Furthermore, every individual of European descent carries on average 94.6 SNVs in CYP genes, of which 24.6 are nonsynonymous, within splice sites, or affect stop codons.9 In the recent PGRN-seq study, 82 genes of pharmacogenomics relevance were sequenced among 5,639 individuals and 40,549 SNVs identified. Of the identified variants, 8,126 were in coding regions (4,858 missense, 3,169 synonymous, and 99 stop gain variants) and 19,923 were in noncoding regions (5,231 intronic, 5,981 upstream, 3,444 downstream, 4,165 3′UTR, 903 5′UTR, and 199 other variants). The majority (∼96%) of individuals had one or more Clinical Pharmacogenetics Implementation Consortium Level A actionable variants, while ∼23% (n = 1,273) of individuals have a single Level A actionable variant.10 The Human Gene Mutation Database (HGMD) is a repository of mutations associated with diseases; they are based on published literature, including GWA studies, and as of June 2013, had 141,161 germline mutation entries in 5,700 unique genes. Missense substitutions, nonsense substitutions, splicing substitutions, and substitutions within regulatory elements account for 44%, 11%, 9%, and 2% of the total disease-associated mutations, respectively.11 Exome sequencing of healthy individuals revealed that each of these individuals carried 40–110 disease-causing mutations as classified by HGMD.12 The large amount of genomic variation data now available has created a clear need for the functional characterization of many genetic variants; this should help to distinguish disease-causing variants vs. passenger mutations. In most cases, genotype–phenotype association studies are impractical to assess the role of rare genetic variants, as very large patient cohorts are needed to include enough patients with the rare variants to achieve statistical significance. An alternative is to determine the functional impact of the rare variants and then combine variants with similar functional effects into one group. This has been done with previous studies focusing on CYP2D6 by assigning an “activity score” to each patient that is derived based on the patients’ genotype. Many variants, including rare variants, are classified as functional, partially functional, or nonfunctional. The activity score is then calculated and translated into a predicted CYP2D6 phenotype.13, 14 For any other gene, if the functional impact of the variants are known, this approach could be used to simplify the genotype interpretation and facilitate genotype–phenotype association studies. The remainder of this review will discuss the current status of many high-throughput functional screening assays (Table 2). These assays should help distinguish the functional from passenger variants, which will provide valuable information for the successful implementation of genomic-driven medicine. The exome is ∼1% of the human genome and contains 23,500 protein-coding genes with roughly 180,000 exons. Large-scale sequencing studies have focused on identifying genetic variants within the exome, as these variants could alter protein function.1, 2 The number of identified genetic variants differs significantly between populations, with individuals of African descent being more genetically diverse.1 African individuals typically carry 12,200 nonsynonymous and 13,800 synonymous variants. The number of nonsynonymous (10,200) and synonymous (11,200) variants identified among individuals of East Asian or European ancestry were fewer (Figure 1).1 Despite the identification of numerous variants in regions coding for proteins, only a portion of these variants disrupt protein function and are likely disease-causing. The average number of loss-of-function variants per genome ranged from 149 in Europeans to 182 among Africans.1 Missense or nonsense substitutions in protein-coding genes contribute ∼55% of variants implicated in disease.11 Missense and nonsense nucleotide substitutions and frameshift indels alter the amino acid sequences of the proteins, which can lead to altered secondary, tertiary, and quaternary structures of proteins. These alterations can change many characteristics of the proteins, such as thermodynamic stability and cellular localization and, consequently, cellular functions of the protein, such as enzymatic activity, cell signaling, and ligand binding.15 Therefore, it is necessary to distinguish loss-of-function variants that could be disease-causing from neutral indels or nucleotide substitutions. Multiple computational tools have been developed to predict if variants affect protein structure and stability and whether variants in conserved regions are neutral, deleterious, or hyperactivating.16, 17 These currently available tools lack accuracy and, thus, cannot be used in a clinical setting.18 The models used for these prediction tools are limited by the accuracy of annotated variant effects and evolutionary measures in the training data sets. The complex relationship between evolution and phenotypic effect could also result in high false-positive and false-negative prediction rates by these tools.17, 19, 20 The challenges with computational prediction tools can be improved by combining this approach with experimental functional characterization of variants. Large data sets with experimental measures of the phenotypic effects of variants can also be used to confirm predictions or act as training data sets to improve prediction algorithms.21 Mutagenesis studies are commonly used to assess the effect of genetic variants on protein function. A forward genetics approach is time-consuming, because random mutations are created and genes are then identified based on the phenotype that develops. A reverse genetics approach involves the mutagenesis of defined genes, which is followed by functional assays to characterize protein sequence–function relationships; however, these are low-throughput, as the effect of only a small number of variants are assessed. These traditional approaches are also especially laborious as they require the use of a wide variety of techniques and personnel expertise, depending on the function of a specific protein being tested.22 To improve the rate of functional testing, deep mutational scanning has been developed as a high-throughput assay to characterize the function of thousands of variants simultaneously.21-24 This method can be used to test the functional impact of multiple variant types, including SNPs, indels, and larger structural variants. Design of a deep mutational scanning experiment depends on the type of protein functional assay that is used. For example, for genes that encode enzymes, an enzyme reporter assay may be used. For proteins without known functions or if the measurement of interest is protein stability, then quantifying the protein levels may be desirable. The deep mutational scanning process can be divided into several steps roughly defined as mutagenesis, protein expression, mutant selection, high-throughput sequencing, and statistical analysis. A detailed deep mutational scanning protocol has recently been published by Fowler and Fields.23 Several studies have used deep mutational scanning with different mutagenesis methods, expression systems, and selection approaches to create sequence-function maps. The first step involves the synthesis of a systematic or random library of mutants that target a specific site in a protein. This step can be performed by creating oligos either designed with defined mutations or mutations introduced randomly through polymerase chain reaction (PCR) amplification. Mingo et al.25 recently developed a one-tube-only standardized site-directed mutagenesis approach. Oligo synthesis is followed by the introduction of the mutant oligo pool into an expression system. Forsyth et al.26 demonstrated the use of mammalian cell-based assays where a protein is expressed from a plasmid or viral vector. Alternatively, M13 or T7 phage-display systems can be used to display up to 1012 clones, about 1010 clones in bacteria, 106 in yeast expression systems, or more than 1012 proteins in ribosome display systems.27-30 The choice of protein expression system to use depends on features of the system, such as how well the expressed protein variant represent the phenotype, ability of the system to do appropriate post-translational modifications, and not only on the number of possible clones. A moderate selection pressure is applied that is appropriate to the protein function assessed. Variant effects have been assessed by testing their impact on protein structure, mechanism of action, catalytic or enzymatic activity (for example, phosphorylation or ubiquitination), thermodynamic stability, protein interaction, peptide binding, DNA or RNA binding, ligand binding, epitope binding, protein aggregation, or expression of a fluorescent protein.18, 21-23 Multiple studies have suggested that additional selection rounds improved accuracy of estimating the fitness for each variant.31, 32 High-throughput sequencing is then used to identify the variants with the altered phenotypic activity. During the creation of the pool of plasmids a barcoding strategy, where ∼20 bp barcodes specific to each mutation is added outside of the open reading frame, is useful for massive parallelization and correction for library amplification biases and the sequencing error rate.33 A library of 100,000 variants requires about 107 sequence reads for adequate coverage (i.e., at least 100 reads per variant).22, 23 It is important to determine the initial frequency of each mutant within the pool before selection. Enrichment of beneficial variants or depletion of deleterious variants are calculated by comparing the frequency (again determined by sequencing) of each mutant after one or multiple rounds of selection to the initial frequency. Statistical analysis is used to identify mutants that are significantly increased or decreased in frequency during selection.18, 21-24, 34 Data analysis is straightforward when direct selection is used for a protein property; for example, assessment of thermodynamic stability by thermal denaturation. Fowler et al.35 developed a freely available software package, Enrich, to convert high-throughput sequencing data into a functional score for each variant and create a sequence-function map. Recently, another software package called dms_tools was developed to infer mutation impact from mutation count data by using a likelihood-based analysis.36 However, standards for deep mutational scanning data analysis have not yet been developed.22, 23 The context in which mutations occur further complicates the prediction of phenotype from genotype. Deep mutational scanning offers an unbiased technique to test the effect of a combination of mutants at once. Mutants can display epistasis when the observed effect is different from the expected additive effect of the mutants. Mutants might together either cause an unexpected large change in activity or one variant might rescue the destabilizing effect of another.18 Hemani et al.37 estimated that pairwise epistasis explains approximately one-tenth the amount of phenotypic variance that additive effects do. Wu et al.38 developed a method to calculate the estimated mutational stability effect from double-substitution functional fitness profiles to account for the effect of variant combinations. Several challenges and potential improvements to the current deep mutational scanning approach have been suggested. Assessment of variant function is difficult for proteins for which the function is unknown; selection in those experiments may be limited to assays assessing thermostability or degradation rate of the gene product.22, 23, 34 Selection assays used during deep mutational scanning are often specific to a protein and its function being tested. Designing these assays frequently remains a challenge. For example, coupling of cell-based properties such as protein localization with high-throughput sequencing, and might not reflect the complexity of human disease. Paired-end sequencing and inclusion of replicate samples can be used to correct for the average per-base sequencing error rate of ∼1%. Furthermore, inclusion of known completely nonfunctional variants is useful for estimating error rates and can improve the accuracy of fitness estimates.21 Kowalsky et al.39 developed a standardized protocol to resolve sequence–function relationships for full-length proteins by using a gene tiling technique to divide long gene sequences into different sequencing libraries to overcome the disadvantages of short sequencing reads. Functional characterization of genetic variants with the use of deep mutational scanning, in addition to genotype–phenotype association studies, is valuable in diagnosing, treating, and understanding disease risk or prognosis.18 For example, variants of unknown significance are continuously identified in BRCA1 in cancer patients by DNA sequencing. Recently, deep mutational scanning was used in a prospective manner to measure the effects of nearly 2,000 missense substitutions in the BARD1 RING domain of BRCA1 on its E3 ubiquitin ligase activity and binding to this domain. The resulting variant functional scores were used to create a prediction model of variant effect on homology-directed DNA repair. This model will likely improve the interpretation of variants observed in the clinical sequencing of the BRCA1 gene.40 In addition to understanding the function of variants of uncertain significance, it is also important to be able to discriminate driver mutations from passenger mutations in protein domains or entire cancer-related proteins; furthermore, it is also useful to understand the impact of the mutation on the protein's function, its effect on cellular function, and its drugability. The study by Wagenaar et al.41 is another example of how deep mutational scanning can be used in precision medicine. Mutant selection with vemurafenib exposure in mammalian cells and mouse xenografts was used to identify variants in the kinase active site of BRAF that are involved in resistance to treatment of BRAFV600E-positive melanomas. The kinase activity of the BRAFV600E/L505H mutation combination was higher than that of the well-characterized V600E mutation alone. The increased kinase activity of the BRAFV600E/L505H mutation combination could result in this mutation combination being moderately resistant to mitogen-activated extracellular signal-regulated kinase (MEK) inhibition. Additional crystal structure comparisons suggest that other BRAF inhibitors will be more effective than a MEK inhibitor in eliciting a response in BRAFV600E/L505H-containing melanomas. This method could also be applied to other proteins for evaluating resistance to inhibitors.41 A deep mutational scanning method was also developed for antibody complementarity-determining regions to simultaneously determine the effect of every possible single amino acid substitution on antigen binding. This method was then applied to a humanized version of the anti-epidermal growth factor receptor antibody cetuximab. Although the majority of complementarity-determining region substitutions are neutral or deleterious to antibody interaction, 67 of the 1,060 tested point mutations increased its affinity. This approach will likely be useful in the future for the development of additional antibody therapies that target cells with specific genetic mutations or variants.26 Cystic fibrosis is an autosomal recessive genetic disorder that affects chloride transport. A genetic variant, G551D, exists in the coding region of the CF transmembrane conductance regulator (CFTR) gene. The functional consequence of this variant was discovered with the use of high-throughput assays. Although the variant does not impact transport of the CFTR protein to the cell surface, it impairs the ability of the membrane channel to open. This phenotypic effect is associated with abnormalities in the respiratory, endocrine, gastrointestinal, and reproductive systems. In cystic fibrosis patients, this phenotype can be improved with the therapeutic agent ivacaftor. Ivacaftor is a CFTR potentiator because it can alter the activity of the channel by increasing the opening probability and flow of ions. The development of ivacaftor for use in G551D variant carriers provides a further example of how high-throughput functional assays can facilitate identification of actionable drug targets and development of targeted therapeutics.42, 43 Endogenous gene expression is controlled by a variety of regulatory regions in the DNA. These regions serve as binding sites for several activator and repressor proteins and RNAs that alter gene expression. Genetic variations in these enhancer elements, transcription factor binding sites, promoter regions, and untranslated regions (UTRs) can alter the binding of these proteins and RNAs leading to changes in gene expression.44-46 The ultimate level of gene expression is the combination of the effects of all of these binding sites together. According to the 1000 Genomes Project Consortium, the median number of variants among continental population groupings range from 288,000–354,000 variants in enhancers, 748–927 in transcription factor binding sites, 82,000–102,000 in promoters, and 30,000–37,200 in the UTRs per typical human genome (see Figure 1). Estimates indicate that ∼500,000 of these variant sites are likely to be functional.1 Traditionally, the gold standard for studying this process has been to use individual reporter assays that have enhancer elements inserted upstream of a minimal promoter. The strength and effects of the enhancer elements are measured by determining the expression of a reporter gene (e.g., LacZ, luciferase, and green fluorescent protein (GFP)) that is driven by the minimal promoters and enhancers.47, 48 In addition, these assays are also standard assays for assessing the effect of genetic alterations in miRNA binding sites.45, 49 Given the large number of regulatory SNPs that need to be tested, this has led to advances towards more high-throughput functional testing. Massively parallel reporter assays are one of the high-throughput functional assays that have been used to assess genetic variants in regulatory regions. For example, using this technique, enhancer sequences containing variations at many positions in the cis core regulatory elements were synthesized using programmable microarrays and inserted into a promoter. Tagging barcodes were also inserted into the expressed sequence. These regulatory variants were then transcribed in vitro and the expression of each barcode was measured by RNA-sequencing; the expression of each barcode reflected the relative activity of the promoter variants that each barcode was tagging.50 This method was then modified by synthesizing over 100,000 enhancer variants that were cloned upstream of a minimal promoter luciferase plasmid with the barcode precloned in the 3ꞌUTR to allow for random barcoding. The library of plasmids was injected into mice via tail vein injection. The plasmids are taken up and expressed in the liver and the reporter expression was measured by RNA-seq.51 Variations of massively parallel reporter assays have been developed and utilized by several groups involving different cell lines and animal models, as well as adaptations to increase the throughput of the assay.52-60 CRISPR-Cas9-mediated in situ saturating mutagenesis has been used to assess the effects of genetic variants in the BCL11A gene enhancer. This gene is a repressor of fetal hemoglobin levels and a therapeutic target for β-hemoglobin disorders. Using the CRISPR-Cas9 nuclease system, they deleted the 12-kb enhancer of BCL11A gene using a pair of guide RNAs (gRNAs) to create paired double-strand breaks.61 To further assess single nucleotide changes and a complete knockout of the enhancer in BCL11A, they then synthesized a saturating gRNA library tiling the enhancer region. They disrupted almost all the sequences within the enhancer with Cas9 cleavage and nonhomologous end joining repair. The library was cloned into a lentiviral vector and transduced into HUDEP-2 cells at low multiplicity to achieve a single gRNA per cell. After expansion and differentiation, cells were sorted by fetal hemoglobin levels, which has been previously validated to be regulated by BCL11A. DNA was isolated, sequenced, and mapped back to the genome to assess variations in the enhancer region associated with the high- and low-fetal hemoglobin pools.61 Multiplexed editing regulatory assay (MERA) is another high-throughput functional screen for genomic variant effects on gene expression. This technique is a CRISPR-Cas9-based mutational screening tool in which adaptations have allowed for one regulatory element to be targeted per cell. This is performed through integration of a single gRNA expression construct into a universally accessible ROSA locus of mouse embryonic stem cells. This gRNA expression construct was driven by a U6 promoter driving the expression of a dummy gRNA inserted into stem cells using CRISPR-Cas9-mediated homologous recombination. A library of over 3,900 gRNAs tiling the cis-regulatory region was created for each of the four genes of interest, Nanog, Rpp25, Tdgf1, and Zfp42. Homologous recombination was used to replace the dummy gRNA with a gRNA from the library which occurred in ∼30% of cells to create a functional gRNA expression construct. GFP knock-in lines that were generated for these four stem cell-specific genes were sorted based on GFP expression and deep-sequencing of gRNA-induced mutations were analyzed to assess which mutations induced loss of GFP expression. A linear regression model was developed in order to detect statistically significant gRNAs that are expressed in the different GFP populations using the GFP targeting gRNAs as the positive control and dummy gRNAs as a negative control.62 Another high-throughput functional assay that can be adapted to study SNP effects in enhancers is STARR-seq. Self-transcribing active regulatory region sequencing (STARR-seq) involves cloning enhancer elements downstream of a minimal promoter and into the 3ꞌUTR of reporter genes. This ectopic, plasmid-based assay allows for these active enhancers to self-transcribe and become part of the reporter transcripts when transfected into cells. Expression of the transcripts, which include the inserted enhancer sequences, are measured by RNA-seq. This method was first developed and assessed using the Drosophila melanogaster genome and has the capacity to identify and quantify enhancer activity in humans.63 This method has been applied to several enhancer elements, such as hormone responsive enhancers, as well as a modified capture approach (CapSTARR-seq) in which DNA fragments are captured on a custom-designed microarray and cloned into STARR-seq vector.64, 65 Collectively, there are multiple high-throughput technologies for asses

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call