Abstract

BackgroundDe novo assembling of large genomes, such as in conifers (~ 12–30 Gbp), which also consist of ~ 80% of repetitive DNA, is a very complex and computationally intense endeavor. One of the main problems in assembling such genomes lays in computing limitations of nucleotide sequence assembly programs (DNA assemblers). As a rule, modern assemblers are usually designed to assemble genomes with a length not exceeding the length of the human genome (3.24 Gbp). Most assemblers cannot handle the amount of input sequence data required to provide sufficient coverage needed for a high-quality assembly.ResultsAn original stepwise method of de novo assembly by parts (sets), which allows to bypass the limitations of modern assemblers associated with a huge amount of data being processed, is presented in this paper. The results of numerical assembling experiments conducted using the model plant Arabidopsis thaliana, Prunus persica (peach) and four most popular assemblers, ABySS, SOAPdenovo, SPAdes, and CLC Assembly Cell, showed the validity and effectiveness of the proposed stepwise assembling method.ConclusionUsing the new stepwise de novo assembling method presented in the paper, the genome of Siberian larch, Larix sibirica Ledeb. (12.34 Gbp) was completely assembled de novo by the CLC Assembly Cell assembler. It is the first genome assembly for larch species in addition to only five other conifer genomes sequenced and assembled for Picea abies, Picea glauca, Pinus taeda, Pinus lambertiana, and Pseudotsuga menziesii var. menziesii.

Highlights

  • De novo assembling of large genomes, such as in conifers (~ 12–30 Gbp), which consist of ~ 80% of repetitive DNA, is a very complex and computationally intense endeavor

  • A fifth set of reads was added to the analysis. This set included all reads, but the PE and MPE reads were decoupled and used as single reads. This set was generated because we found experimentally that the CLC Assembly Cell assembler was able to process the entire volume of the L. sibirica sequence data, but only if the information about the length of the insertion was not indicated

  • Using the new stepwise de novo assembling method presented in the paper, the genome of Siberian larch, Larix sibirica Ledeb. (12.34 Gbp) was for the first time completely assembled de novo by the CLC Assembly Cell assembler

Read more

Summary

Introduction

De novo assembling of large genomes, such as in conifers (~ 12–30 Gbp), which consist of ~ 80% of repetitive DNA, is a very complex and computationally intense endeavor. One of the main problems in assembling such genomes lays in computing limitations of nucleotide sequence assembly programs (DNA assemblers). The de novo assembling of large genomes, such as in conifers, that have the length of 12 to 30 Gbp and consist of about 80% of highly repetitive elements (repeats), is a rather complex task [1,2,3,4,5,6,7,8,9,10,11,12]. Modern assemblers are designed to assemble genomes shorter or equal to the length of the Kuzmin et al BMC Bioinformatics 2019, 20(Suppl 1):

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call