Abstract
Improvements in long-read data and scaffolding technologies have enabled rapid generation of reference-quality assemblies for complex genomes. Still, an assessment of critical sequence depth and read length is important for allocating limited resources. To this end, we have generated eight assemblies for the complex genome of the maize inbred line NC358 using PacBio datasets ranging from 20 to 75 × genomic depth and with N50 subread lengths of 11–21 kb. Assemblies with ≤30 × depth and N50 subread length of 11 kb are highly fragmented, with even low-copy genic regions showing degradation at 20 × depth. Distinct sequence-quality thresholds are observed for complete assembly of genes, transposable elements, and highly repetitive genomic features such as telomeres, heterochromatic knobs, and centromeres. In addition, we show high-quality optical maps can dramatically improve contiguity in even our most fragmented base assembly. This study provides a useful resource allocation reference to the community as long-read technologies continue to mature.
Highlights
Improvements in long-read data and scaffolding technologies have enabled rapid generation of reference-quality assemblies for complex genomes
To identify an optimal assembly approach for this study, the complete data from NC358 were each assembled using Falcon[21], Canu[22], WTDBG223, and hybrid approaches in which Falcon was used for error correction and Canu, Flye[24], and Peregrine[25] were used for assembly (Supplementary Table 2)
We evaluated the completeness of gene-rich regions in each of the assemblies using BUSCO28
Summary
Improvements in long-read data and scaffolding technologies have enabled rapid generation of reference-quality assemblies for complex genomes. An assessment of critical sequence depth and read length is important for allocating limited resources To this end, we have generated eight assemblies for the complex genome of the maize inbred line NC358 using PacBio datasets ranging from 20 to 75 × genomic depth and with N50 subread lengths of 11–21 kb. Recent long-read assemblies in maize show considerable improvement in both completeness and contiguity relative to previous efforts[13,14,15,16], suggesting these data are useful for plant species like maize with genomes that are large (2.3 Gb), complex (paleopolyploid and comprised primarily of TEs), and highly repetitive (extensive tandem sequence arrays in heterochromatic knobs and centromeres). We conduct a comprehensive assembly experiment using subsets of a high-depth, long-read (PacBio) dataset for the maize inbred line NC358 to evaluate critical inflection points of quality during the assembly of a complex, repeat-rich genome
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.