Abstract

HomeCirculation: Cardiovascular GeneticsVol. 6, No. 4Short Read (Next-Generation) Sequencing Free AccessResearch ArticlePDF/EPUBAboutView PDFSections ToolsAdd to favoritesDownload citationsTrack citationsPermissions ShareShare onFacebookTwitterLinked InMendeleyReddit Jump toFree AccessResearch ArticlePDF/EPUBShort Read (Next-Generation) SequencingA Tutorial With Cardiomyopathy Diagnostics as an Exemplar Jaya Punetha, MS and Eric P. Hoffman, PhD Jaya PunethaJaya Punetha From the Department of Integrative Systems Biology, The George Washington University School of Medicine (J.P., E.P.H.); and Research Center for Genetic Medicine, Children’s National Medical Center, Washington, DC (J.P., E.P.H.). Search for more papers by this author and Eric P. HoffmanEric P. Hoffman From the Department of Integrative Systems Biology, The George Washington University School of Medicine (J.P., E.P.H.); and Research Center for Genetic Medicine, Children’s National Medical Center, Washington, DC (J.P., E.P.H.). Search for more papers by this author Originally published14 Jul 2013https://doi.org/10.1161/CIRCGENETICS.113.000085Circulation: Cardiovascular Genetics. 2013;6:427–434Other version(s) of this articleYou are viewing the most recent version of this article. Previous versions: August 20, 2013: Previous Version 1 IntroductionRapid advances in DNA sequencing technologies have made it increasingly cost-effective to obtain accurate and timely large-scale genomic sequence data on individuals (short read massively parallel or next generation [next-gen]). A next-gen molecular diagnostic approach that has seen rapid deployment in the clinic over the last year is exome sequencing. Whole exome sequencing covers all protein-coding genes in the genome (≈1.1% of genome), and an exome test for a single patient generates ≈6 gigabases (109 bp) of DNA sequence data. A key challenge facing routine use of next-gen data in patient diagnosis and management is data interpretation. What sequence variant findings are relevant to diagnosis (pathogenic mutations)? What sequence variant findings are relevant to clinical care but not necessarily to patient diagnosis (clinically actionable incidental data)? What sequence information should be stored, and where can it be stored? This review provides a tutorial on current approaches to answering these questions. A recent landmark study showed that application of next-gen sequencing to a large cohort of idiopathic dilated cardiomyopathy patients found ≈27% of patients to show mutations of the titin gene, the most complex gene in the genome (363 exons).We use titin in cardiomyopathy as an exemplar for explaining next-gen sequencing approaches and data interpretation.Comparing Sequencing StrategiesDecreasing sequencing costs and broad dissemination of next-generation (next-gen) equipment and expertise are increasing availability of massively parallel sequencing of patient DNA samples (short read massively parallel or next-gen sequencing).1,2 Most rapidly expanding is exome sequencing, where all protein-coding sequences (exons) are selected from total genomic DNA and selectively sequenced.3 Alternative approaches to next-gen sequencing include targeted sequencing (TS) and whole genome (complete genome) sequencing. Currently, marketed targeted Sanger sequencing panels using traditional individual exon-by-exon sequencing remain expensive and time consuming, and massively parallel next-gen approaches are beginning to supplant Sanger sequencing in the clinic (Figure 1).4Download figureDownload PowerPointFigure 1. Schematic comparison of Sanger sequencing and next-generation sequencing technologies.5 For the purpose of molecular diagnostics, genomic DNA (gDNA) is generally the starting material for both techniques. Sanger sequencing (left): A, The first step for sequencing by this method is to design target primers for a specific region in a gene of interest, preferably between 300 and 800 bp. Polymerase chain reaction (PCR) amplification with target primers and genomic DNA is then performed, and the PCR product is visualized on an agarose gel to confirm predicted size. B, The sequencing reaction is performed using fluorescently labeled ddNTPs (dideoxynucleotide triphospate) as chain terminators, DNA polymerase, dNTPs (deoxynucleotide triphosphates), PCR product generated in the earlier step, and primers (can be either forward primer or reverse primer per reaction). C, As DNA polymerase adds nucleotides to the denatured strand, it randomly picks up labeled ddNTPs, which cause termination of the reaction, resulting in a large number of fragments of various sizes. These fragments are then subject to capillary electrophoresis to separate them by size. Shorter fragments travel faster toward the negatively charged electrode, whereas longer fragments move slower. D, The last fluorescently labeled ddNTP for each fragment is recorded as a colored peak, which is used to generate a sequencing trace, with each base assigned a specific color based on the fluorescent dye. Next-generation sequencing (right): The methods depicted in this figure are based on second-generation sequencing with the Illumina platform. A, The first step of this method is to fragment genomic DNA to a uniform size. Generally, fragmentation is performed using adaptive focused acoustics technology by the Covaris instrument. The required size distribution is then confirmed by an agarose gel; image on the right shows sheared DNA. B, Sequence enrichment is performed for targeted or exome sequencing; whole genome sequencing does not require enrichment. To make the library sequencing-ready, adapters are ligated to both ends of the DNA. The different colored adapters (pink and blue) reflect a sequencing adapter and a barcoding adapter. After barcoding individual samples, they can be pooled for sequencing to optimize sequencing costs. C, The library is then immobilized to an array (flow cell) where bridge amplification occurs to generate clusters (clonal libraries). D, Sequencing by synthesis technology is used to detect each base by fluorescently labeled dNTPs. The image shows individual clusters shown on the Sequencing Analysis Viewer software by Illumina during sequencing. A small section of the image represented in the yellow box is zoomed into, and each of the fluorescent dots on the image represents a library cluster. Base call files (images) are then converted to .fastq files (DNA letters) that can be aligned to the genome. The average number of reads from an individual’s exome sequence (minimum coverage of ×50) is ≈40 million (in-house sequencing data).Here, we compare and contrast the 3 next-gen approaches to molecular diagnostics: TS panels, whole exome sequencing (WES), and whole genome sequencing (WGS; Table 1). Briefly, TS selects candidate genes that are already known to cause the disorder in question and targets only these for extensive sequence analyses (≈200 kb to 1 million bp; ≈200–2000 exons). Exome sequencing (WES) pulls out most exons (coding sequence) of all genes in the genome for sequencing (≈30 million bp; ≈180 000 exons). WGS nondiscriminately sequences all 6 billion bp of DNA in a patient, including the large majority of DNA that does not code for proteins and remains problematic to interpret.Table 1. Comparison of Existing Next-Generation Sequencing Strategies for Molecular Diagnosis of a DiseaseTargeted SequencingWhole Exome SequencingWhole Genome SequencingApproachCreate a custom panel to target a select number of genes to study a genetic diseaseUse exome capture kits to target the protein-coding genes of the genome (1.1% of total genome)Whole genome re-sequencing by fragmenting genomic DNA and sequencingApprox size500 kb–3 Mb44–62 Mb (most kits include splice junctions)3.1 GbCapture techniqueMolecular inversion probesHybridization with biotinylated oligonucleotide baitsServices for whole genome re-sequencing provided by CompleteGenomics, BGIAmericas, Illumina, Knome, etcPCR based (RainDance, Fluidigm)Different capture techniques based on DNA/RNA baits (NimbleGen, Agilent, Illumina)7Most include mitochondrial DNA sequencing tooHybrid capture (Agilent, NimbleGen)6……AdvantagesGreater sensitivity as higher coverageHigh chance of finding a mutation in all the protein-coding regionsCovers everything—can identify pathogenic variants, including structural variation for all genetic diseasesLower drop-out rate of exons (increases specificity for regions of interest)Cost-effective, data interpretation easier than WGSSequence enrichment not requiredEasier assembly and analysisCan discover a new gene associated with disease…Quick, cost-effective……Disadvantages Pathogenic mutations could be in genes not coveredHigh drop-out rate of exonsAssembly and analysis very challengingCan miss large indels/duplications and CNVsAnalysis and interpretation of data can get complicatedNeed huge compute power for this analysis, data size is 15× of WES data8….CNVs and large indels/duplications not detectedNot cost-effective; more expensive for higher coverage dataSuccessRapid, cost-effective, successfully used for relatively common genetic diseases as a first-pass screening technique.Higher success rate when exomes of parents of proband also sequenced, and when used in conjunction with TS panels for dropped out regions and mtDNAStill in nascent phase, will revolutionize the field of human genetics in the futureCNV indicates copy number variation; PCR, polymerase chain reaction; TS, targeted sequencing; WES, whole exome sequencing; and WGS, whole genome sequencing.Both targeted next-gen sequencing and whole exome next-gen sequencing have recently entered the molecular diagnostics workspace, with multiple private and academic labs offering next-gen sequencing on a clinical, fee-for-service basis.9,10 The costs of generating targeted and WES data are not very different. If targeted panels covering ≈50 genes cost about the same as WES covering 20 000 genes, then it would seem that “more is better,” and the exome will become the standard diagnostic test. However, there are some technical differences in how the data are generated and interpreted that make the targeted versus exome choice less clear cut. The key issue is that TS is able to cover 99% of the candidate genes, whereas exomes cover only ≈90% of the same candidate genes (as well as 90% of all other genes; in-house sequencing data). This 10% difference in sensitivity is significant. For example, as described in more detail below, if a patient with dilated cardiomyopathy (DCM) comes into your clinic and you suspect a titin gene mutation, you may wish to confirm this suspicion by next-gen sequencing. Ordering a TS panel that includes the titin gene will cover 99% of the 363 exons and likely rule in or rule out titin as a cause of the patient’s disorder. However, sending the same patient out for a whole exome analysis will miss ≈1 of 10 exons; so for the 363 exon titin gene, ≈36 exons will be missing from the data generated (often called exon drop outs). Thus, a negative result could be because of missing data due to drop out, and thus not rule out a titin mutation. The higher sensitivity of TS panels has been successfully shown in the diagnosis of cardiomyopathies by Meder et al.11In addition to the issue of differential drop out, TS and exome sequencing can differ in terms of accuracy of detection of mutations and copy number variations (CNVs). Accuracy for both small mutations (single base changes, small deletions/duplications) and CNVs depends on read depth. Depth refers to the number of independent sequence reads generated for each specific region of the genome tested. A typical exome has a median read-depth of 50; each exon shows ≈50 independent reads. In contrast, targeted re-sequencing panels may achieve a median read-depth of ≥500 reads per exon (10× greater than exomes). In general, the greater the depth, the more accurate the detection of mutations, and the easier it is to detect CNVs. It is possible to increase the machine time on an exome sequence to match the depth of TS, but this increases machine time allocated per sample, and thus cost.Whole complete genome sequencing has not yet entered the molecular diagnostics workspace. There has been a single company marketing WGS, Complete Genomics, but uptake has been limited. The Complete Genomics method that has been used to generate most whole genomes produces large numbers (120 billion) of short (33 bp) reads, shorter than other next-gen sequencing platforms.12 Interpretation of pathogenic variants in the large majority of noncoding sequence (99% of genome) remains difficult, and this results in relatively little interest in generating these data in the clinical setting. However, as sequencing technologies continue to rapidly advance, and costs decline, this may change in the not so distant future.Variant Calling: Benign, Pathogenic, Variant of Unknown Significance, or IncidentalOnce next-gen sequencing data are obtained with adequate depth, the bioinformatics process of interpretation of the data begins to define potential mutations, dubbed variant calling (Figure 2). Variants are typically grouped into 4 categories: benign, pathogenic, variant of unknown significance (VOUS/VUS), or incidental findings (IFs). Pathogenic variants are mutations—variants in a gene that are likely or known to cause the patient’s symptoms (Table 2). Benign polymorphisms are variants that are common in healthy populations and are not considered candidates for disease-causing variants. Variants of unknown significance are sequence changes that are not seen at high frequency in general populations, but where there is little or no supportive evidence of the variant being pathogenic. An IF is defined as a sequence variant that has potential health or reproductive significance to the patient being tested but is secondary to the patient’s primary clinical complaint (eg, a variant known to be associated with drug sensitivity but unrelated to the complaint of cardiac symptoms; Table 2).13Download figureDownload PowerPointFigure 2. Analysis pipeline for whole exome sequencing (WES) data used for molecular diagnostics of dilated cardiomyopathy (DCM). This pipeline is for data from the Illumina platform using a commercially available software NextGENe (SoftGenetics, PA). Raw data from Illumina is in .bcl (basecall) format, which is converted to .fastq format using Illumina’s software CASAVA. If samples were pooled per lane, they can be demultiplexed based on the barcode adapters attached to specific samples. Reads are filtered for a minimum quality score of 30, which reflects a single error in 1000 bases. The .fastq files can then be imported to NextGENe, where the first step is conversion to .fasta format. The .fasta files for each paired read are then merged together and aligned to the reference human genome using custom alignment and variant calling algorithms. The variants are annotated using the database for non-synonymous functional predictions (dbNSFP).18 The dbNSFP combines prediction scores from different prediction algorithms (SIFT [Sorting Intolerant From Tolerant], Polyphen-2, Mutation Taster, likelihood ratio test [LRT]) and includes conservation scores for all non-synonymous single-nucleotide polymorphisms (SNPs) in the human genome. Thus, enabling comparison of different functional prediction algorithms at once and giving a probability score ranging from 0 to 1 (0 being benign and 1 being disease causing). For detection of pathogenic variants, further variant filtration is required. This is achieved by looking at only the coding sequences (CDS); splice junctions (±5 bases from exons), removing synonymous SNPs, and removing polymorphisms already reported in dbSNP and 1000 genomes. We then create a filter to first look at the 51 genes already reported to cause DCM as there is a greater probability of finding a mutation in 1 of these genes given the patient’s phenotype. If variants are detected in the known DCM-causing genes, the variants are confirmed by Sanger sequencing and cross-referenced with the patient’s physician. Because exome sequencing has a high rate of false positives, confirming that the variant is not a polymorphism on the exome variant server44 is the next step. If no variants in known genes are detected in the patient, we then proceed to analyzing variants in the whole exome.Table 2. Classification of Variants Obtained From NGS Data Using TTN as an ExampleVariantDefinitionExampleSignificanceReportPathogenicDisease-causing mutationFrame-shifting variant in TTN A-bandDiagnosisYesBenign polymorphismNatural variation in DNA with no known adverse effectVariant in TTN present in dbSNP with high frequencyCould be a genetic modifier in future studiesNoVOUS/VUSVariant of uncertain or unknown significanceMissense variant in a less evolutionarily conserved region of TTNMay change status to pathogenic with additional dataYesIncidental findingsVariants that have been associated with some risk of a disease state but are not relevant to the patient’s diagnostic questionsCarrier state for mutations in CFTR geneNot diagnostic. Assessment of clinically actionable defined by ACMG guidelinesYes, if clinically actionableGenetic modifierVariant that modifies a disease state because of a second geneHypothetical variant in a gene that increases or decreases TTN expressionGenetic modifiers of monogenic disorders remain fewNoACMG indicates American College of Medical Genetics and Genomics; CFTR, cystic fibrosis; dbSNP, single-nucleotide polymorphism database; NGS, next-generation sequencing; TTN, transmembrane conductance regulator; titin; and VOUS, variant of unknown significance.A growing number of software tools and database resources greatly aid in variant calling and variant filtering, and these are increasingly adept at predicting which variant has the highest probability of causing disease. Predictive software, such as Mutation Taster,14 Polyphen-2,15 likelihood ratio test16 (of codon constraint), and SIFT, [Sorting Intolerant From Tolerant]17 takes into account evolutionary conservation and amino acid substitution to predict the probability of the mutation being disease causing or benign. Different functional prediction tools use different methods to assign likelihood of functional significance with each tool having its own strengths and weaknesses. The database of non-synonymous functional prediction18 includes functional prediction scores from 4 predictive software (MutationTaster, Polyphen-2, SIFT, [Sorting Intolerant From Tolerant] and likelihood ratio test) and a conservation score (PhyloP [phylogenetic P-values]). Therefore, it gives likelihood scores for functional prediction of non-synonymous single-nucleotide polymorphisms from multiple databases with a single query.Different mutations in a single gene can cause varying phenotypes (phenotypic and allelic heterogeneity), and the bioinformatics pipeline to classify variants as likely pathogenic, possibly pathogenic or variant of unknown significance, benign polymorphism, or clinically actionable (or nonactionable) incidental is not black and white. Typically, the larger a gene becomes, the more variants it contains, and categorizing variants and defining what is pathogenic is more difficult for larger genes like titin. We define and discuss the classification of variants into these categories, using titin as an example in Table 2. The clinical phenotype of a patient can also be considered a highly relevant filter for classifying variants. There can be a top candidate gene in the list of those containing potential pathogenic variants (eg, disease-causative variant) given concordance of the patient’s clinical symptoms to previous reports.Confirming variants as pathogenic is easier if the variants are previously reported and known to be causative of the disease being studied. With novel variants in genes related to the disease where the variant is predicted to be pathogenic (evolutionarily conserved region or frame-shifting variant), further analysis may be required. Testing can include segregation analysis in families, functional testing for reduced protein levels in tissue/biopsy, checking databases for known polymorphisms, or in vitro protein remodeling studies.19CNVs From Next-Gen DataDetection of deletions or duplications of exons (CNVs) is necessary for high sensitivity of detection for all types of mutations. With the aid of newer bioinformatics tools, it is possible to detect CNVs in exome sequencing datasets using CNV calling algorithms; however, these tools are still being evaluated before moving to the clinical realm.20–22 Next-gen sequencing does not detect CNVs directly via sequence data, but the number of sequence reads per exon can be used to derive the presence of CNVs. A log2 ratio of reads per kilobase of exon model per million mapped reads for 1 patient’s next-gen reads can be compared with the average of the other patients for each exon, with significant deviations from baseline indicating the presence of a deletion or duplication.23 The accuracy and sensitivity of all CNV detection methods depend on the depth of the sequence data, where targeted next-gen data typically give greater depth, and hence greater sensitivity, for CNVs compared with exome or whole genome data.Interpretation of CNVs and small mutation pathogenic data is often simplified by the sequencing of parents and siblings, as well as the proband. Having the sequence data of the parents permits the testing of different inheritance models. For example, de novo mutations are increasingly seen as a cause of genetic disease.24,25 However, detection of de novo mutational events is all but impossible without the sequence of both parents in hand.The current ambiguities concerning sensitivity and specificity of detecting and reporting pathogenic mutations and CNVs, as well as incidental data, require consent forms that speak to these ambiguities.26,27 Before offering WGS tests, medical diagnostic laboratories have to be equipped with detailed consent forms and resources for data storage, analysis, and interpretation.28IFs: What Is Considered Clinically Actionable?Incidental findings (IFs) are variants that have defined functions associated with them, such as the sequence change in the Factor V clotting protein that is associated with changes in thrombosis in general populations (Factor V Leiden; ≈5% of individuals of European descent).29–31 Another example of an IF is identification of the carrier state for the common deltaF508 cystic fibrosis mutation32 in a patient where cystic fibrosis is not in the differential diagnosis. A key issue with incidental data is whether they are actionable and how the definition of actionable changes as a function of the patient’s age. For example, to a child, neither Factor V Leiden nor deltaF508 CF carrier status is clinically actionable (nothing would be done differently in the clinical management of the child given knowledge or lack of knowledge of these genotypes). Yet, both become clinically actionable in the context of a young adult. A pregnant woman who is a carrier for Factor V Leiden has a 5- to 10-fold increase in the risk of venous thrombosis events, whereas homozygotes have nearly a 100-fold increased risk.33 Similarly, the carrier status for deltaF508 becomes clinically actionable when considering pregnancy and having children, and genetic counseling of the carrier is warranted. The definition of clinically actionable versus nonactionable incidental data from next-gen sequencing studies is under active discussion by national regulatory and academic groups and is hotly debated.34–36 Recently, the American College of Medical Genetics and Genomics has issued a set of guidelines for reporting of IFs after exome and WGS testing a clinical setting.37 The authors provide a list of 57 genes where known pathogenic or expected pathogenic mutations should be reported back to the physician, although these genes are not related to the initial clinical complaint or differential diagnosis of the patient (eg, incidental data). Some examples include BRCA1 (breast cancer 1, early onset risk for breast cancer), MYH7 (cardiomyopathy risk), and RYR1 (malignant hyperthermia susceptibility). There are concerns that this will increase costs of these tests in the future because of increased burden on data analyses and interpretation. For example, the RYR1 gene has 106 exons,38 and variants are frequently found in this gene by exome sequencing. Assigning pathogenic significance to a variant in RYR1 in terms of risk for malignant hyperthermia is extraordinarily challenging. One must also question the cost/benefit of reporting back RYR1 expected pathogenic incidental data, as the risk for malignant hyperthermia is generally only after halothane anesthetics, and only a small subset of the population is ever exposed to this.39–41A recent publication outlines a proposed bioinformatics tool with minimal researcher effort that looks for IFs (defined as variants known to cause Mendelian diseases) from WGS data as a possible solution to reduce the burden on laboratory personnel.42 The interpretation, reporting, and retention in medical records are currently quite variable, and most remains in the context of research studies rather than routine molecular diagnostics. It will take a while for laboratories and institutional review boards to incorporate the new American College of Medical Genetics and Genomics guidelines.Data StorageWES on an individual generates ≈6 to 12 gigabases (109 bp) of data with >30 000 variants. WES data analysis strategies focus on filtering out variants at different levels to only look at novel mutations in coding regions for prioritizing possible disease-causing variants.43 With increasing amounts of data from next-generation sequencing efforts shared across research sites, the database grows and data interpretation of new samples becomes more robust. One can easily imagine that integration of next-gen data from hundreds of thousands of individuals worldwide creates a powerful world genetic knowledge base, where defining the nature of any particular variant becomes increasingly accurate.A database specific for WES data, the Exome Variant Server by NHLBI GO Exome Sequencing Project, has become an important resource for filtering variants as it contains data from WES (non-pathogenic variants) of >6000 individuals from black and European American populations.44 Although database resources are quickly improving, there remain technical differences between different next-gen sequencing equipment and bioinformatics methods that make the bioinformatics data resources a work in progress. For example, there was a recent report of a study where the same DNA samples were sent to a number of the top sequencing laboratories in the world, with comparison of the variant classification between sites. There was surprisingly little agreement in the variant analysis of the same individual (www.rd-neuromics.eu; January 24, 2013 workshop).Exemplar in Cardiogenomics: Genetics of DCMDilated cardiomyopathy (also CMD) is a condition in which the heart’s ability to pump blood is lessened because of its weakened and enlarged state, eventually leading to heart failure. DCM is characterized by enlargement of the left ventricle and systolic dysfunction. It is the third most common cause of heart failure with an estimated frequency of 1:2500 in the general population.45 Around 20% to 48% of idiopathic DCM cases were found to have an underlying genetic cause (familial DCM/famililial dilated cardiomyopathy).46–49 FDC was estimated to be found in 20% to 35% of first-degree relatives of patients diagnosed with idiopathic DCM50 and may be inherited in autosomal dominant, autosomal recessive, X-linked, or mitochondrial inheritance patterns.51 To date, mutations in 51 genes have been reported to cause DCM.52 Diverse inheritance patterns and a large number of causative genes make molecular diagnostics of DCM a challenge.Until recently, patients and physicians only had the option of iteratively sequencing genes 1 at a time with high costs and long turn-around times. This involved send-outs to multiple laboratories and companies, often internationally, with each negative result leaving a long list of genes yet to be tested. This was especially challenging for cardiomyopathies because of the inclusion of some of the largest and most complex genes in the genome (DMD, dystrophin; TTN, titin).Over the last few years, several DCM gene testing panels have been introduced commercially in the United States—Harvard Medical School and Partners Healthcare: DCM panel, 28 genes; GeneDx: DCM panel, 38 genes; AmbryGenetics DCM panel, 37 genes; and Transgenomic DCM panel, 13 genes (GeneTests; DCM). These panels offer TS for coding regions (exons) of genes related to DCM. The results are available in a timely manner and are relatively cost-effective (≈$4000) as multiple genes are tested at once.An example of where next-gen sequencing is making a major impact on the diagnosis of DCM is with regard to mutations in the titin gene and protein. The titin protein, as its name belies, is the largest known protein in humans, with ≈34 000 amino acids, and a molecular weight of ≈3810 kDa (100× larger than average). For comparison, the average protein in the human body is ≈300 amino acids and 30 kDa in molecular weight.53 The gene encoding titin is also the most complex in the genome, with 363 exons used in many different alternatively spliced transcripts (Figure 3).54 The titin protein is a key component of muscle tissue (both skeletal muscle and cardiac muscle), where it serves as a structural and functional linker between adjacent sarcomeres (Figure 3), and giving structural integrity to the characteristic striated appearance of the actin/myosin network.55–57

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call