Cancer genomics: new software tools making sequencing more accessible.

En-guo Chen,Pengyuan Liu,Yan Lu

doi:10.2217/pme.13.108

Abstract

Personalized MedicineVol. 11, No. 2 EditorialFree AccessCancer genomics: new software tools making sequencing more accessibleEn-guo Chen, Pengyuan Liu & Yan LuEn-guo ChenSir Run Run Shaw Hospital & The Institute for Translational Medicine, School of Medicine, Zhejiang University, Hangzhou, Zhejiang 310058, ChinaSearch for more papers by this author, Pengyuan LiuSir Run Run Shaw Hospital & The Institute for Translational Medicine, School of Medicine, Zhejiang University, Hangzhou, Zhejiang 310058, ChinaDepartment of Physiology & the Cancer Center, Medical College of Wisconsin, Milwaukee, WI 53226, USASearch for more papers by this author & Yan LuDepartment of Physiology & the Cancer Center, Medical College of Wisconsin, Milwaukee, WI 53226, USAWomen’s Hospital & The Institute for Translational Medicine, School of Medicine, Zhejiang University, Hangzhou, Zhejiang 310029, ChinaSearch for more papers by this authorPublished Online:23 Apr 2014https://doi.org/10.2217/pme.13.108AboutSectionsPDF/EPUB ToolsAdd to favoritesDownload CitationsTrack CitationsPermissionsReprints ShareShare onFacebookTwitterLinkedInReddit Keywords: bioinformaticscancer genomicsnext-generation sequencingpersonalized medicinesoftware toolsNext-generation sequencing transforms today’s cancer researchNext-generation sequencing (NGS) technologies can parallelize sequencing processes and produce millions of short-read sequences concurrently. The advent of NGS technologies has transformed today’s cancer genome research by providing an unbiased and comprehensive method of detecting somatic genome alterations [1]. The application of NGS technologies, through whole-genome, -exome and -transcriptome approaches has led to the discovery of mutated genes that drive oncogenic phenotypes in tumors.In a recent The Cancer Genome Atlas (TCGA) Pan-Cancer effort [2], 127 mutated driver genes were identified from 3281 tumors across 12 tumor types. These cancer drivers converged on several well-known pathways such as cell cycle and MAPK pathways. Interestingly, these drivers are also involved in cellular processes in cancers that have previously been less well characterized such as histone modification, RNA splicing, cellular metabolism and proteolysis. Mutations in transcription factors are more tissue specific, whereas mutations in histone modifiers are shared among multiple cancer types.The average number of driver mutations varies across tumor types and most individual tumors have two to six mutations [2]. Generally, larger numbers of driver mutations are involved in tumors with high levels of background mutations [3]. For example, most squamous cell lung carcinomas are attributed to life-long tobacco exposure. Carcinogens from tobacco exposure can cause a broad spectrum of DNA lesions on the genome, and this increases the chance that mutations conferring a small growth advantage to cells are selected in the lung microenvironment. Many such mutations with small effects can accumulate in a specific group of cells over time and collectively lead to carcinogenesis. This is in contrast to hematologic cancers, where fewer driver mutations (with potentially large effects) are observed in tumors with low background mutations.Driver genes are promising drug targets and their discovery has the potential to revolutionize personalized gene-targeted treatments that guide patient therapy according to the genomic profile of the tumor. The battle of Dr Lukas Wartman at Washington University in St Louis (WUSTL; MO, USA) against leukemia is a prime example of the promise of personalized medicine [4]. When Wartman had relapsed a second time, his colleagues at WUSTL sequenced the entire genome of DNA and RNA from his cancerous cells. His RNA sequencing showed that a gene called FLT3 was unexpectedly overproduced by his cancer cells. It so happens that the drug Sutent®, previously approved for treating advanced kidney cancer, effectively inhibits FLT3. Sutent was prescribed to Wartman, and his leukemia was in remission 2 weeks later. In next decade, we will witness many such cases come to the forefront of cancer treatment and soon realize the promise of personalized medicines purred by advances in cancer genomics and associated drug development.Software tools for analyzing cancer genomicsComputational tools for cancer genomic analysis have been actively developed to fully and accurately catalog genomic variation from the huge quantity of NGS-generated experimental data. Over 100 software tools currently exist [5], and largely fall into three categories: alignment and assembly; variant detection; and downstream analysis.NGS instruments generally produce short reads, meaning short sequences of approximately 200 bases. Once the raw sequence data is obtained, the first step of NGS data processing is the alignment of reads followed by the genome assembly. The aligning (or mapping) of these short reads against a reference genome is called reference mapping. Reference mapping used to be a computational bottleneck of NGS data analysis. Fortunately, many open-source short-read alignment tools have been published in the past few years and effectively deal with this problem, in spite of the exponential growth of the NGS data [6]. In particular, several short-read aligners including Burrows–Wheeler Alignment tool (BWA) [7], Bowtie [8] and SOAP [9], which are based on the Burrows–Wheeler transformation (BWT) algorithm, perform extremely fast and are being used heavily. The BWT-based aligners can map a human genome in a matter of hours instead of days, as previously required by tools such as MAQ [10]. These computationally efficient aligners also make analysis of NGS data more practical in clinical care where turnaround time is critical.Once alignment is finished, subsequent analyses are performed to detect a variety of genomic alterations in the DNA of cancer cells by comparing matched tumor–normal sequence alignments. Somatic mutation calling is more complex than germline mutation calling because of nontumor DNA contamination, cell heterogeneity and subclones in cancer samples. A number of variant detection tools have been developed; however, currently there has been no systematic benchmarking of their performance.Somatic SNPs are the most abundant and reliably detected variants. Currently, several software tools such as MuTect [11] and SomaticSniper [12] are available for reliable identification of somatic SNP variants. The other types of somatic variants are less reliably detected, including small insertions and deletions (InDels), large structural variations (SVs) and somatic copy number alterations (SCNAs). SAMtools [13] Pindel [14] and GATK UnifiedGenotyper [15] are commonly used for calling small InDels. It is worth noting that local realignment of reads implemented in GATK is necessary to correct misalignments in the presence of InDels and reduces the error rate of subsequent InDel calling. SVs including large InDels, inversions, tandem duplications and translocations constitute another frequent type of alteration in tumors. Tools tailored for identifying SVs such as CREST, which uses NGS reads with partial alignments to a reference genome to directly map SVs at the nucleotide level, are being developed [16]. CREST was reported with up to 80% of experimental validation. Methods for detecting SCNAs including large amplifications and deletions using NGS data are also available [17].After variant detection, downstream bioinformatics analyses are needed to functionally annotate somatic mutations and predict their functional relevance to cancers. One of the big challenges is distinguishing driver mutations that are required for the cancer phenotype from passenger mutations that accumulate through DNA replication but are irrelevant to tumor development. Generally, nonsense mutations, mutations in essential splice sites and frameshift InDels result in truncated, incomplete and nonfunctional protein products, and thus have the greatest impact on the protein function. The functional impact of nonsynonymous SNPs (NSs), which constitute a large majority of somatic variants detected in tumors, on protein function remain more complex and require more detailed analysis. Various algorithms have thus been developed for prediction of NS function, and are based on phylogenetics and structural biology [18]. A noteworthy database for this purpose is dbNSFP [19], which compiles a functional prediction score from multiple algorithms for every potential NS in the human genome (>80 million).Instead of assessing the functional significance of individual mutations, several methods such as (Driver Genes and Pathways) DrGaP are to determine likely biological significance by examining if a gene is significantly mutated among tumor samples [3]. The rationale behind these approaches is that driver mutated genes have a higher nonsilent mutation rate than the background (or passenger) mutation rate. Similar strategies can be also applied to assess the significance of mutations in a set of genes or pathways. The number of gene sets that define these pathways and processes is much less than the number of genes and can provide clarity to lists of genes identified through mutational analyses. However, identification of driver genes is still a major challenge as there is substantial variation in mutation frequency and mutation spectrum across the genome and among cancer types [20].New challenges & opportunitiesNGS-driven cancer genomics promise to alter medical practice to offer improved healthcare in the foreseeable future. New therapeutic targets, as well as diagnostic, prognostic and predictive genomic signatures are emerging from recent NGS studies. However, the challenges of translating genome discoveries into clinical practice are numerous and formidable. Many of the challenges in translational cancer genomics can be summed up by the ‘big data’ and ‘genomic complexity’ of tumors.To date, high-resolution and high-throughput technologies have produced mountains of genomic data. The ability to integrate these ‘omics’ data sets (i.e., genomic, epigenomic, transcriptomic and proteomic platforms) to provide system-level measurement of the genetic complexity of a tumor represents a revolutionary development in cancer genomics. Improvements in these abilities will lead to more rapid translation of genomic discoveries and improved health care. TCGA consortium recently launched a Pan-Cancer Initiative that aims to integrate data sets across tumor types as well as genomic platforms. These integrated analyses will eventually be aimed at guiding clinicians to extend therapies effective in one cancer type to others with a similar molecular profile.Each cancer genome is genetically unique and complex, characterized by many alterations ranging from single nucleotide substitutions to complex rearrangements. It thus remains difficult to correctly align short-read sequences from tumor cells with considerable chromosomal instability using reference mapping. Hence, de novo mapping remains the most powerful approach for tackling this type of genomic complexity. New assemblers are being developed and aim to overcome several bottlenecks, such as assembly quality, computer memory requirement and execution time.As NGS is becoming more affordable, laboratories have produced terabytes of sequence data, comparable to the output of small sequencing centers from just a few years ago. In contrast to major genome centers such as Broad Institute, which has high-performance computing resources and automated analysis platforms that integrate a variety of free, open-source and custom-designed software tools, the analysis of huge quantities of NGS data by individual laboratories is challenging. Commercial, third-party vendors such as CLC bio and DNAnexus that adapt publicly available software may provide alternative solutions to this challenge. In addition, cloud computing technologies have made it possible to analyze large genomic data sets in scalable and cost-effective ways. One major advantage of using cloud solutions is that individual laboratories can avoid upfront infrastructure costs without compromising the completion of their applications. It is worth noting that Illumina recently launched a specialized genomic analysis cloud platform, ‘BaseSpace’, that directly integrated with all MiSeq and HiSeq sequencing instruments. It streamlines NGS data analysis, facilitates sharing, scales for storage needs, provides security, enhances bioinformatics access with data analysis apps and automatically transfers data to the cloud in real time during a sequencing run. These new-trend developments will make sequencing more accessible to researchers and clients, and are likely to provide further insight into fundamental causes of cancers.AcknowledgementsThe authors would like to thank HG Vikis for reading and commenting on the manuscript.Financial & competing interests disclosureThis work has been supported in part by start-up from Advancing a Healthier Wisconsin Fund (FP00001701 and FP00001703), the Louisiana Hope Research Grant provided by Free to Breathe, Research Affairs Committee, Women Health Research Program, National Natural Science Foundation of China (No. 81372514) and the Fundamental Research Funds for the Central Universities of China. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.No writing assistance was utilized in the production of this manuscript.References1 Meyerson M, Gabriel S, Getz G. Advances in understanding cancer genomes through second-generation sequencing. Nat. Rev. Genet. 11(10), 685–696 (2010).Crossref, Medline, CAS, Google Scholar2 Kandoth C, Mclellan MD, Vandin F et al. Mutational landscape and significance across 12 major cancer types. Nature 502(7471), 333–339 (2013).Crossref, Medline, CAS, Google Scholar3 Hua X, Xu H, Yang Y, Zhu J, Liu P, Lu Y. DrGaP: a powerful tool for identifying driver genes and pathways in cancer sequencing studies. Am. J. Hum. Genet. 93(3), 439–451 (2013).Crossref, Medline, CAS, Google Scholar4 Kolata G. In treatment for leukemia, glimpses of the future. The New York Times, 7th July (2012).Google Scholar5 Software/list. http://seqanswers.com/wiki/Software/listGoogle Scholar6 Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 11(5), 473–483 (2010).Crossref, Medline, CAS, Google Scholar7 Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009).Crossref, Medline, CAS, Google Scholar8 Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009).Crossref, Medline, Google Scholar9 Li R, Li Y, Kristiansen K, Wang J. SOAP: short oligonucleotide alignment program. Bioinformatics 24(5), 713–714 (2008).Crossref, Medline, CAS, Google Scholar10 Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008).Crossref, Medline, CAS, Google Scholar11 Cibulskis K, Lawrence MS, Carter SL et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31(3), 213–219 (2013).Crossref, Medline, CAS, Google Scholar12 Larson DE, Harris CC, Chen K et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28(3), 311–317 (2012).Crossref, Medline, CAS, Google Scholar13 Li H, Handsaker B, Wysoker A et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009).Crossref, Medline, Google Scholar14 Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25(21), 2865–2871 (2009).Crossref, Medline, CAS, Google Scholar15 Depristo MA, Banks E, Poplin R et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43(5), 491–498 (2011).Crossref, Medline, CAS, Google Scholar16 Wang J, Mullighan CG, Easton J et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat. Methods 8(8), 652–654 (2011).Crossref, Medline, CAS, Google Scholar17 Campbell PJ, Yachida S, Mudie LJ et al. The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature 467(7319), 1109–1113 (2010).Crossref, Medline, CAS, Google Scholar18 Gonzalez-Perez A, Mustonen V, Reva B et al. Computational approaches to identify functional genetic variants in cancer genomes. Nat. Methods 10(8), 723–729 (2013).Crossref, Medline, CAS, Google Scholar19 Liu X, Jian X, Boerwinkle E. dbNSFP. A lightweight database of human nonsynonymous SNPs and their functional predictions. Hum. Mutat. 32(8), 894–899 (2011).Crossref, Medline, CAS, Google Scholar20 Lawrence MS, Stojanov P, Polak P et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499(7457), 214–218 (2013).Crossref, Medline, CAS, Google ScholarFiguresReferencesRelatedDetails Vol. 11, No. 2 Follow us on social media for the latest updates Metrics History Published online 23 April 2014 Published in print March 2014 Information© Future Medicine LtdKeywordsbioinformaticscancer genomicsnext-generation sequencingpersonalized medicinesoftware toolsAcknowledgementsThe authors would like to thank HG Vikis for reading and commenting on the manuscript.Financial & competing interests disclosureThis work has been supported in part by start-up from Advancing a Healthier Wisconsin Fund (FP00001701 and FP00001703), the Louisiana Hope Research Grant provided by Free to Breathe, Research Affairs Committee, Women Health Research Program, National Natural Science Foundation of China (No. 81372514) and the Fundamental Research Funds for the Central Universities of China. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.No writing assistance was utilized in the production of this manuscript.PDF download

Full Text