Abstract

The abundant repetitive sequences in complex eukaryotic genomes cause fragmented assemblies, which lose value as reference genomes, often due to incomplete gene sequences and unanchored or mispositioned contigs on chromosomes. Here we report a genome assembly method HERA, which resolves repeats efficiently by constructing a connection graph from an overlap graph. We test HERA on the genomes of rice, maize, human, and Tartary buckwheat with single-molecule sequencing and mapping data. HERA correctly assembles most of the previously unassembled regions, resulting in dramatically improved, highly contiguous genome assemblies with newly assembled gene sequences. For example, the maize contig N50 size reaches 61.2 Mb and the Tartary buckwheat genome comprises only 20 contigs. HERA can also be used to fill gaps and fix errors in reference genomes. The application of HERA will greatly improve the quality of new or existing assemblies of complex genomes.

Highlights

  • The abundant repetitive sequences in complex eukaryotic genomes cause fragmented assemblies, which lose value as reference genomes, often due to incomplete gene sequences and unanchored or mispositioned contigs on chromosomes

  • We have reported a highly efficient assembly method, HERA, to resolve repetitive sequences, which is the central objective for all genome assemblers

  • We demonstrated that HERA could dramatically improve the contiguity and completeness of genome assembly by assembling the previously unassembled repeats including many tandemly repetitive sequences

Read more

Summary

Introduction

The abundant repetitive sequences in complex eukaryotic genomes cause fragmented assemblies, which lose value as reference genomes, often due to incomplete gene sequences and unanchored or mispositioned contigs on chromosomes. Improvements in sequencing read lengths to tens of kb by single-molecule sequencing (SMS) technologies from Pacific Biosciences[2] (PacBio) and most recently from Oxford Nanopore[3] have enabled the assembly of many complex eukaryotic genomes. These assemblies are still fragmented and generate incomplete draft genomes usually consisting of thousands of contigs with many unresolved regions caused by segmentally duplicated repeats or other complex repeats[4,5,6]. The approach assembles unique sequences reliably but repeats longer than the read length lead to branching paths and form fragmented contigs. The method cannot improve the contig lengths and often leaves unfilled gaps of up to hundreds of kb and many unmapped contigs due to lack of labeling enzyme recognition sites

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.