SLR: a scaffolding algorithm based on long reads and contig classification

Ranran Chen,Chaokun Yan,Huimin Luo,Xiaohong Zhang,Junwei Luo,Mengna Lyu

doi:10.1186/s12859-019-3114-9

Ranran Chen, Chaokun Yan + Show 4 more

Open Access

https://doi.org/10.1186/s12859-019-3114-9

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Oct 30, 2019
Citations: 17	License type: open-access

Affiliation: Henan Polytechnic University, Henan University

Abstract

BackgroundScaffolding is an important step in genome assembly that orders and orients the contigs produced by assemblers. However, repetitive regions in contigs usually prevent scaffolding from producing accurate results. How to solve the problem of repetitive regions has received a great deal of attention. In the past few years, long reads sequenced by third-generation sequencing technologies (Pacific Biosciences and Oxford Nanopore) have been demonstrated to be useful for sequencing repetitive regions in genomes. Although some stand-alone scaffolding algorithms based on long reads have been presented, scaffolding still requires a new strategy to take full advantage of the characteristics of long reads.ResultsHere, we present a new scaffolding algorithm based on long reads and contig classification (SLR). Through the alignment information of long reads and contigs, SLR classifies the contigs into unique contigs and ambiguous contigs for addressing the problem of repetitive regions. Next, SLR uses only unique contigs to produce draft scaffolds. Then, SLR inserts the ambiguous contigs into the draft scaffolds and produces the final scaffolds. We compare SLR to three popular scaffolding tools by using long read datasets sequenced with Pacific Biosciences and Oxford Nanopore technologies. The experimental results show that SLR can produce better results in terms of accuracy and completeness. The open-source code of SLR is available at https://github.com/luojunwei/SLR.ConclusionIn this paper, we describes SLR, which is designed to scaffold contigs using long reads. We conclude that SLR can improve the completeness of genome assembly.

Highlights

Scaffolding is an important step in genome assembly that orders and orients the contigs produced by assemblers
To evaluate the performance of SLR, we compared SLR with three popular scaffolding tools based on long reads, namely, SSPACE-LongRead (SSPACE-LR), LINKS and npScarf
E. coli and S. cerevisiae include two different long-read datasets sequenced with Pacific Biosciences and Oxford Nanopore technologies and consist of two different contig sets assembled by different assemblers

Summary

Introduction

Scaffolding is an important step in genome assembly that orders and orients the contigs produced by assemblers. In the past few years, long reads sequenced by third-generation sequencing technologies (Pacific Biosciences and Oxford Nanopore) have been demonstrated to be useful for sequencing repetitive regions in genomes. In the field of de novo genome assembly, a large number of assembly tools based on third-generation sequencing technologies have been presented to resolve the most prominent problem: repetitive regions. The insert size of paired reads can reach a few thousands bases, so this technique can partially resolve the problem of repetitive regions. Such scaffolding tools, such as OPERA [5], SSPACE [6], BESST [7], ScaffMatch [8], SCARPA [9], Luo et al BMC Bioinformatics (2019) 20:539

Methods

Results

Discussion

Conclusion