Abstract

BackgroundProblems associated with using draft genome assemblies are well documented and have become more pronounced with the use of short read data for de novo genome assembly. We set out to improve the draft genome assembly of the African cichlid fish, Metriaclima zebra, using a set of Pacific Biosciences SMRT sequencing reads corresponding to 16.5× coverage of the genome. Here we characterize the improvements that these long reads allowed us to make to the state-of-the-art draft genome previously assembled from short read data.ResultsOur new assembly closed 68 % of the existing gaps and added 90.6Mbp of new non-gap sequence to the existing draft assembly of M. zebra. Comparison of the new assembly to the sequence of several bacterial artificial chromosome clones confirmed the accuracy of the new assembly. The closure of sequence gaps revealed thousands of new exons, allowing significant improvement in gene models. We corrected one known misassembly, and identified and fixed other likely misassemblies. 63.5 Mbp (70 %) of the new sequence was classified as repetitive and the new sequence allowed for the assembly of many more transposable elements.ConclusionsOur improvements to the M. zebra draft genome suggest that a reasonable investment in long reads could greatly improve many comparable vertebrate draft genome assemblies.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-015-1930-5) contains supplementary material, which is available to authorized users.

Highlights

  • Problems associated with using draft genome assemblies are well documented and have become more pronounced with the use of short read data for de novo genome assembly

  • One technique is the use of paired-end and mate-pair jumping libraries. The power of this technique was demonstrated when a usable human draft genome assembly was produced using a combination of differently sized short read jumping libraries (180 bp to 40 kb) with the ALLPATHS-LG assembler [10]

  • Recognizing Errors in Assemblies using Paired Reads (REAPR) predicted a misassembly at this gap for the other 5 jumping libraries either because they did not have spanning mate-pairs, had mate-pairs improperly oriented, and/or had mate-pairs aligning at a distance much different than the expected insert size

Read more

Summary

Introduction

Problems associated with using draft genome assemblies are well documented and have become more pronounced with the use of short read data for de novo genome assembly. Advances in high-throughput genome sequencing have allowed relatively inexpensive genome projects to be conducted for almost any organism. Many different molecular biology and computational techniques have been developed that attempt to circumvent the problems associated with short read length, while keeping the cost of genome sequencing projects low. One technique is the use of paired-end and mate-pair jumping libraries The power of this technique was demonstrated when a usable human draft genome assembly was produced using a combination of differently sized short read jumping libraries (180 bp to 40 kb) with the ALLPATHS-LG assembler [10]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.