LROD: An Overlap Detection Algorithm for Long Reads Based on k-mer Distribution.

Junwei Luo,Chaokun Yan,Zhanqiang Huo,Huimin Luo,Xiaohong Zhang,Yan Wang,Ranran Chen

doi:10.3389/fgene.2020.00632

Junwei Luo, Chaokun Yan + Show 5 more

Open Access

https://doi.org/10.3389/fgene.2020.00632

Copy DOI

Abstract

Third-generation sequencing technologies can produce large numbers of long reads, which have been widely used in many fields. When using long reads for genome assembly, overlap detection between any pair of long reads is an important step. However, the sequencing error rate of third-generation sequencing technologies is very high, and obtaining accurate overlap detection results is still a challenging task. In this study, we present a long-read overlap detection (LROD) algorithm that can improve the accuracy of overlap detection results. To detect overlaps between two long reads, LROD first retains only the solid common k-mers between them. These k-mers can simplify the process of overlap detection. Second, LROD finds a chain (i.e., candidate overlap) that includes the consistent common k-mers. In this step, LROD proposes a two-stage strategy to evaluate whether two common k-mers are consistent. Finally, LROD uses a novel strategy to determine whether the candidate overlaps are true and to revise them. To verify the performance of LROD, three simulated and three real long-read datasets are used in the experiments. Compared with two other popular methods (MHAP and Minimap2), LROD can achieve good performance in terms of the F1-score, precision and recall. LROD is available from https://github.com/luojunwei/LROD.

Highlights

Sequencing technologies fragment the genome into a large number of reads, and the process of recombining these reads into a complete DNA sequence is called genome assembly (Nagarajan and Pop, 2013; Ding and Guo, 2018)
Using only solid k-mers allows long-read overlap detection (LROD) to avoid some problems caused by sequencing errors and repetitive regions
For the real human dataset, when k = 13, Minimap2 and LROD did not end with 10 threads after 10 days, and the memory requirement of MinHash alignment process (MHAP) was larger than the memory capacity of our computer (128 GB)

Summary

Introduction

Sequencing technologies fragment the genome into a large number of reads, and the process of recombining these reads into a complete DNA sequence is called genome assembly (Nagarajan and Pop, 2013; Ding and Guo, 2018). Compared with NGS, third-generation sequencing (TGS) technologies (Schadt et al, 2010), such as single-molecule real-time technology (SMRT) (Levene et al, 2003) and Oxford Nanopore technology (ONT) (Stoddart et al, 2009), can produce longer reads with an average length of 10 kb, with many exceeding 100 kb. This longread length is sufficient to span most repetitive areas.

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in genetics	Publication Date: Jul 29, 2020
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

LROD: An Overlap Detection Algorithm for Long Reads Based on k-mer Distribution.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in genetics

Lead the way for us

Similar Papers

Evaluation of genome assembly software based on long reads
...
-
, et. al. ...
02 Mar 2017
02 Mar 2017

A hybrid approach for the automated finishing of bacterial genomes
...
Nature Biotechnology | VOL. 30
, et. al. ...
01 Jul 2012
Nature Biotechnology | VOL. 30

When Livestock Genomes Meet Third-Generation Sequencing Technology: From Opportunities to Applications.
Xinyue Liu ... Junyuan Zheng
Genes | VOL. 15
Xinyue Liu, et. al.Xinyue Liu ... Junyuan Zheng
15 Feb 2024
Genes | VOL. 15

Advances in DNA sequencing: Challenges and limitations of personal sequencing
...
African Journal of Agricultural Research | VOL. 6
, et. al. ...
31 Mar 2011
African Journal of Agricultural Research | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

LROD: An Overlap Detection Algorithm for Long Reads Based on k-mer Distribution.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in genetics