Abstract
Next-generation sequencing (NGS) technologies have dramatically expanded the breadth of genomics. Genome-scale data, once restricted to a small number of biomedical model organisms, can now be generated for virtually any species at remarkable speed and low cost. Yet non-model organisms often lack a suitable reference to map sequence reads against, making alignment-based quality control (QC) of NGS data more challenging than cases where a well-assembled genome is already available. Here we show that by generating a rapid, non-optimized draft assembly of raw reads, it is possible to obtain reliable and informative QC metrics, thus removing the need for a high quality reference. We use benchmark datasets generated from control samples across a range of genome sizes to illustrate that QC inferences made using draft assemblies are broadly equivalent to those made using a well-established reference, and describe QC tools routinely used in our production facility to assess the quality of NGS data from non-model organisms.
Highlights
Until 5 years ago, genomic research was largely confined to a relatively small number of taxonomic groups in which sequencing efforts were focused on a handful of model organisms
Despite the fact that some of the assemblies were fragmented, we found that quality control (QC) results such as insert size and detection of contaminants derived from alignment of data to QC assemblies using CLC were equivalent to those obtained after alignment to the reference genome
The metrics derived from the unoptimized, CLC draft assembly and mapping approach are closely similar to those from reference genome mapping, and serve to deliver equivalent QC data
Summary
Until 5 years ago, genomic research was largely confined to a relatively small number of taxonomic groups in which sequencing efforts were focused on a handful of model organisms. NGS can be affected by a range of artifacts that arise during the library preparation and sequencing processes, which can negatively impact the quality of the raw data for downstream analyses. These issues include platform specific error profiles, systematic variation in quality scores across the sequence read, biases in sequence generation driven by base composition, departure from optimal library fragment sizes, variation in the proportions of duplicate sequences introduced by PCR amplification bias, and contamination from known and unknown species other than the sequencing target (Schmieder and Edwards, 2011a; Zhou et al, 2013). One of the most popular tools for the generation of these quality metrics is FastQC
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have