Improving the sensitivity of long read overlap detection using grouped short k-mer matches

Nan Du,Yanni Sun,Jiao Chen

doi:10.1186/s12864-019-5475-x

Nan Du, Yanni Sun + Show 1 more

Open Access

https://doi.org/10.1186/s12864-019-5475-x

Copy DOI

Abstract

BackgroundSingle-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies.ResultsIn this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage.ConclusionsGroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK.

Highlights

Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina
For third-generation sequencing data, the high error rate and low coverage make the overlap graph a sensible choice for genome assembly [7]
Due to high error rates, existing short read overlap detection software using Burrows–wheeler transform (BWT) (Burrows-Wheeler transform) or hash table [10, 11] cannot be directly applied to long reads

Summary

Introduction

Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. There is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies. The increased read length enables third-generation sequencing to close gaps in genome assembly [1, 2], reveal structural variations [3], and quantify gene isoforms with higher accuracy [4] in transcriptomic sequencing. For third-generation sequencing data, the high error rate and low coverage make the overlap graph a sensible choice for genome assembly [7]. Due to high error rates, existing short read overlap detection software using BWT (Burrows-Wheeler transform) or hash table [10, 11] cannot be directly applied to long reads

Objectives

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Apr 1, 2019
Citations: 6	License type: open-access

R Discovery Prime

R Discovery Prime

Improving the sensitivity of long read overlap detection using grouped short k-mer matches

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Improve homology search sensitivity of PacBio data by correcting frameshifts.
Nan Du ... Yanni Sun
Bioinformatics (Oxford, England) | VOL. 32
Nan Du, et. al.Nan Du ... Yanni Sun
29 Aug 2016
Bioinformatics (Oxford, England) | VOL. 32

Long-read sequencing in ecology and evolution: Understanding how complex genetic and epigenetic variants shape biodiversity.
Dan G Bock ... Polina Novikova
Molecular ecology | VOL. 32
Dan G Bock, et. al.Dan G Bock ... Polina Novikova
01 Mar 2023
Molecular ecology | VOL. 32

AccuVIR: an ACCUrate VIRal genome assembly tool for third-generation sequencing data.
Runzhou Yu ... Dehan Cai
Bioinformatics (Oxford, England) | VOL. 39
Runzhou Yu, et. al.Runzhou Yu ... Dehan Cai
26 Dec 2022
Bioinformatics (Oxford, England) | VOL. 39

LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly.
Gui-Cai Xu ... Yan Zhang
GigaScience | VOL. 8
Gui-Cai Xu, et. al.Gui-Cai Xu ... Yan Zhang
21 Dec 2018
GigaScience | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving the sensitivity of long read overlap detection using grouped short k-mer matches

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics