Abstract

Many genomes have been sequenced to high-quality draft status using Sanger capillary electrophoresis and/or newer short-read sequence data and whole genome assembly techniques. However, even the best draft genomes contain gaps and other imperfections due to limitations in the input data and the techniques used to build draft assemblies. Sequencing biases, repetitive genomic features, genomic polymorphism, and other complicating factors all come together to make some regions difficult or impossible to assemble. Traditionally, draft genomes were upgraded to “phase 3 finished” status using time-consuming and expensive Sanger-based manual finishing processes. For more facile assembly and automated finishing of draft genomes, we present here an automated approach to finishing using long-reads from the Pacific Biosciences RS (PacBio) platform. Our algorithm and associated software tool, PBJelly, (publicly available at https://sourceforge.net/projects/pb-jelly/) automates the finishing process using long sequence reads in a reference-guided assembly process. PBJelly also provides “lift-over” co-ordinate tables to easily port existing annotations to the upgraded assembly. Using PBJelly and long PacBio reads, we upgraded the draft genome sequences of a simulated Drosophila melanogaster, the version 2 draft Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the Sooty mangabey. With 24× mapped coverage of PacBio long-reads, we addressed 99% of gaps and were able to close 69% and improve 12% of all gaps in D. pseudoobscura. With 4× mapped coverage of PacBio long-reads we saw reads address 63% of gaps in our budgerigar assembly, of which 32% were closed and 63% improved. With 6.8× mapped coverage of mangabey PacBio long-reads we addressed 97% of gaps and closed 66% of addressed gaps and improved 19%. The accuracy of gap closure was validated by comparison to Sanger sequencing on gaps from the original D. pseudoobscura draft assembly and shown to be dependent on initial reference quality.

Highlights

  • Genome finishing has become a lost art due to the expense of oligonucleotide directed Sanger sequencing relative to the low cost-per-base of second generation sequencing technologies

  • The first generation of large eukaryotic model organism genome sequencing projects, such as Drosophila melanogaster [1], Caenorhabditis elegans [2], Arabidopsis thaliana [3], human [4], and mouse [5], all relied on a mapped bacterial artificial chromosome (BAC) approach

  • Sanger whole genome assemblies often used as few reads as possible, saving millions of dollars, but producing lower quality genomes using as little as 66 genome coverage, falling significantly short of the 10– 156 required for high quality draft assemblies

Read more

Summary

Introduction

Genome finishing has become a lost art due to the expense of oligonucleotide directed Sanger sequencing relative to the low cost-per-base of second generation sequencing technologies. In the BAC approach, individual mapped BACs were shotgun sequenced, assembled, and manually finished before being pieced together creating the final, finished reference genome. Because of the prohibitive cost and labor required for BAC library creation, arraying, mapping, and preparation of subclone libraries from tens of thousands of BACs, these techniques fell out of favor. They were replaced by significantly less expensive and time-consuming whole genome assembly methods. Assembly methods used relatively long (500– 800 bp) shotgun Sanger reads with Overlap-Layout-Consensus assemblers [6,7,8,9,10,11]. Sanger whole genome assemblies often used as few reads as possible, saving millions of dollars, but producing lower quality genomes using as little as 66 genome coverage, falling significantly short of the 10– 156 required for high quality draft assemblies

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.