Abstract

Third-generation sequencing technologies can produce large numbers of long reads, which have been widely used in many fields. When using long reads for genome assembly, overlap detection between any pair of long reads is an important step. However, the sequencing error rate of third-generation sequencing technologies is very high, and obtaining accurate overlap detection results is still a challenging task. In this study, we present a long-read overlap detection (LROD) algorithm that can improve the accuracy of overlap detection results. To detect overlaps between two long reads, LROD first retains only the solid common k-mers between them. These k-mers can simplify the process of overlap detection. Second, LROD finds a chain (i.e., candidate overlap) that includes the consistent common k-mers. In this step, LROD proposes a two-stage strategy to evaluate whether two common k-mers are consistent. Finally, LROD uses a novel strategy to determine whether the candidate overlaps are true and to revise them. To verify the performance of LROD, three simulated and three real long-read datasets are used in the experiments. Compared with two other popular methods (MHAP and Minimap2), LROD can achieve good performance in terms of the F1-score, precision and recall. LROD is available from https://github.com/luojunwei/LROD.

Highlights

  • Sequencing technologies fragment the genome into a large number of reads, and the process of recombining these reads into a complete DNA sequence is called genome assembly (Nagarajan and Pop, 2013; Ding and Guo, 2018)

  • Using only solid k-mers allows long-read overlap detection (LROD) to avoid some problems caused by sequencing errors and repetitive regions

  • For the real human dataset, when k = 13, Minimap2 and LROD did not end with 10 threads after 10 days, and the memory requirement of MinHash alignment process (MHAP) was larger than the memory capacity of our computer (128 GB)

Read more

Summary

Introduction

Sequencing technologies fragment the genome into a large number of reads, and the process of recombining these reads into a complete DNA sequence is called genome assembly (Nagarajan and Pop, 2013; Ding and Guo, 2018). Compared with NGS, third-generation sequencing (TGS) technologies (Schadt et al, 2010), such as single-molecule real-time technology (SMRT) (Levene et al, 2003) and Oxford Nanopore technology (ONT) (Stoddart et al, 2009), can produce longer reads with an average length of 10 kb, with many exceeding 100 kb. This longread length is sufficient to span most repetitive areas.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call