Abstract
Large-scale initiatives aiming to recover the complete sequence of thousands of human genomes are currently being undertaken worldwide, concurring to the generation of a comprehensive catalog of human genetic variation. The ultimate and most ambitious goal of human population scale genomics is the characterization of the so-called human “variome,” through the identification of causal mutations or haplotypes. Several research institutions worldwide currently use genotyping assays based on Next-Generation Sequencing (NGS) for diagnostics and clinical screenings, and the widespread application of such technologies promises major revolutions in medical science. Bioinformatic analysis of human resequencing data is one of the main factors limiting the effectiveness and general applicability of NGS for clinical studies. The requirement for multiple tools, to be combined in dedicated protocols in order to accommodate different types of data (gene panels, exomes, or whole genomes) and the high variability of the data makes difficult the establishment of a ultimate strategy of general use. While there already exist several studies comparing sensitivity and accuracy of bioinformatic pipelines for the identification of single nucleotide variants from resequencing data, little is known about the impact of quality assessment and reads pre-processing strategies. In this work we discuss major strengths and limitations of the various genome resequencing protocols are currently used in molecular diagnostics and for the discovery of novel disease-causing mutations. By taking advantage of publicly available data we devise and suggest a series of best practices for the pre-processing of the data that consistently improve the outcome of genotyping with minimal impacts on computational costs.
Highlights
The steady reduction in sequencing costs associated with the advent of the new generation of ultrahigh throughput sequencing platforms, collectively known as Next-Generation Sequencing (NGS) technologies, is one of the major drivers of the so called “genomic revolution.” Consequent to the development of these novel ultra efficient sequencing technologies [see (Goodwin et al, 2016) for a comprehensive review] the number of publicly available human genome and exome sequences is in the hundreds of thousands, steadily increasing on a daily basis (Stephens et al, 2015)
Various “pilot” (Gurdasani et al, 2015; Nagasaki et al, 2015; Sidore et al, 2015; The 1000 Genomes Project Consortium et al, 2015; Lek et al, 2016) projects, sequencing thousands of human genomes and exomes have been successfully undertaken which demonstrate the power of big data genomics for the identification of deleterious mutations and providing a substantial contribution to the understanding of the evolutionary processes that shape the genomes of modern human populations
Human genome resequencing data can be highly heterogeneous due to inherent biases introduced by different library preparation protocols, sequencing platforms and experimental strategies
Summary
The steady reduction in sequencing costs associated with the advent of the new generation of ultrahigh throughput sequencing platforms, collectively known as Next-Generation Sequencing (NGS) technologies, is one of the major drivers of the so called “genomic revolution.” Consequent to the development of these novel ultra efficient sequencing technologies [see (Goodwin et al, 2016) for a comprehensive review] the number of publicly available human genome and exome sequences is in the hundreds of thousands, steadily increasing on a daily basis (Stephens et al, 2015). This figure could be, an over-estimate due to ascertainment bias, since the majority of large-scale human genome resequencing projects aimed at the detection of disease-causing mutations have been carried out by means of WES sequencing, and a significant proportion of the studies was focused on rare monogenic Mendelian diseases In this respect data, produced by WGS offer a more granular representation of the genomic variability, facilitating a more accurate reconstruction of the haplotypes which can be instrumental for the detection of genomic loci associated with complex phenotypic traits, including diseases like atherosclerosis, diabetes, and hypertension. Three major approaches are commonly used for the pre-processing of reads obtained from large-scale resequencing studies: quality trimming, that is the polishing of the reads based on descriptive statistics calculated on their quality scores; PCR de-duplication, consisting in the elimination of identical reads or read pairs that might derive from PCR amplification of the same DNA fragment; merging of overlapping pairs, that consolidates pairs of reads originating from DNA fragments shorter than the combined length of the mates, into a longer, non-redundant sequence
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.