Evaluation of Quality Assessment Protocols for High Throughput Genome Resequencing Data.

Matteo Chiara,Giulio Pavesi

doi:10.3389/fgene.2017.00094

Matteo Chiara, Giulio Pavesi

Open Access

https://doi.org/10.3389/fgene.2017.00094

Copy DOI

Journal: Frontiers in Genetics	Publication Date: Jul 7, 2017
Citations: 12	License type: cc-by

Affiliation: University of Milan

Abstract

Large-scale initiatives aiming to recover the complete sequence of thousands of human genomes are currently being undertaken worldwide, concurring to the generation of a comprehensive catalog of human genetic variation. The ultimate and most ambitious goal of human population scale genomics is the characterization of the so-called human “variome,” through the identification of causal mutations or haplotypes. Several research institutions worldwide currently use genotyping assays based on Next-Generation Sequencing (NGS) for diagnostics and clinical screenings, and the widespread application of such technologies promises major revolutions in medical science. Bioinformatic analysis of human resequencing data is one of the main factors limiting the effectiveness and general applicability of NGS for clinical studies. The requirement for multiple tools, to be combined in dedicated protocols in order to accommodate different types of data (gene panels, exomes, or whole genomes) and the high variability of the data makes difficult the establishment of a ultimate strategy of general use. While there already exist several studies comparing sensitivity and accuracy of bioinformatic pipelines for the identification of single nucleotide variants from resequencing data, little is known about the impact of quality assessment and reads pre-processing strategies. In this work we discuss major strengths and limitations of the various genome resequencing protocols are currently used in molecular diagnostics and for the discovery of novel disease-causing mutations. By taking advantage of publicly available data we devise and suggest a series of best practices for the pre-processing of the data that consistently improve the outcome of genotyping with minimal impacts on computational costs.

Highlights

The steady reduction in sequencing costs associated with the advent of the new generation of ultrahigh throughput sequencing platforms, collectively known as Next-Generation Sequencing (NGS) technologies, is one of the major drivers of the so called “genomic revolution.” Consequent to the development of these novel ultra efficient sequencing technologies [see (Goodwin et al, 2016) for a comprehensive review] the number of publicly available human genome and exome sequences is in the hundreds of thousands, steadily increasing on a daily basis (Stephens et al, 2015)
Various “pilot” (Gurdasani et al, 2015; Nagasaki et al, 2015; Sidore et al, 2015; The 1000 Genomes Project Consortium et al, 2015; Lek et al, 2016) projects, sequencing thousands of human genomes and exomes have been successfully undertaken which demonstrate the power of big data genomics for the identification of deleterious mutations and providing a substantial contribution to the understanding of the evolutionary processes that shape the genomes of modern human populations
Human genome resequencing data can be highly heterogeneous due to inherent biases introduced by different library preparation protocols, sequencing platforms and experimental strategies

Summary

INTRODUCTION

The steady reduction in sequencing costs associated with the advent of the new generation of ultrahigh throughput sequencing platforms, collectively known as Next-Generation Sequencing (NGS) technologies, is one of the major drivers of the so called “genomic revolution.” Consequent to the development of these novel ultra efficient sequencing technologies [see (Goodwin et al, 2016) for a comprehensive review] the number of publicly available human genome and exome sequences is in the hundreds of thousands, steadily increasing on a daily basis (Stephens et al, 2015). This figure could be, an over-estimate due to ascertainment bias, since the majority of large-scale human genome resequencing projects aimed at the detection of disease-causing mutations have been carried out by means of WES sequencing, and a significant proportion of the studies was focused on rare monogenic Mendelian diseases In this respect data, produced by WGS offer a more granular representation of the genomic variability, facilitating a more accurate reconstruction of the haplotypes which can be instrumental for the detection of genomic loci associated with complex phenotypic traits, including diseases like atherosclerosis, diabetes, and hypertension. Three major approaches are commonly used for the pre-processing of reads obtained from large-scale resequencing studies: quality trimming, that is the polishing of the reads based on descriptive statistics calculated on their quality scores; PCR de-duplication, consisting in the elimination of identical reads or read pairs that might derive from PCR amplification of the same DNA fragment; merging of overlapping pairs, that consolidates pairs of reads originating from DNA fragments shorter than the combined length of the mates, into a longer, non-redundant sequence

Materials and Methods

Findings

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evaluation of Quality Assessment Protocols for High Throughput Genome Resequencing Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics

Lead the way for us

Similar Papers

Application of whole genome re-sequencing data in the development of diagnostic DNA markers tightly linked to a disease-resistance locus for marker-assisted selection in lupin (Lupinus angustifolius).
Huaan Yang ... Jonathan Clements
BMC Genomics | VOL. 16
Huaan Yang, et. al.Huaan Yang ... Jonathan Clements
02 Sep 2015
BMC Genomics | VOL. 16

Next Generation Sequencing Technologies and Their Applications
Ku Chee‐Seng ... Loy En Yun
-
Ku Chee‐Seng, et. al.Ku Chee‐Seng ... Loy En Yun
19 Apr 2010
19 Apr 2010

One over PAR or one under PAR: vive la différence
Jorge Di Paola ... Paul F Bray
Blood | VOL. 132
Jorge Di Paola, et. al.Jorge Di Paola ... Paul F Bray
08 Nov 2018
Blood | VOL. 132

Accurate filtering of privacy-sensitive information in raw genomic data
Jérémie Decouchant ... Paulo Esteves-Veríssimo
Journal of Biomedical Informatics | VOL. 82
Jérémie Decouchant, et. al.Jérémie Decouchant ... Paulo Esteves-Veríssimo
13 Apr 2018
Journal of Biomedical Informatics | VOL. 82

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluation of Quality Assessment Protocols for High Throughput Genome Resequencing Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Genetics