KmerStream: streaming algorithms for k -mer abundance estimation

Páll Melsted,Bjarni V Halldórsson

doi:10.1093/bioinformatics/btu713

Abstract

Several applications in bioinformatics, such as genome assemblers and error corrections methods, rely on counting and keeping track of k-mers (substrings of length k). Histograms of k-mer frequencies can give valuable insight into the underlying distribution and indicate the error rate and genome size sampled in the sequencing experiment. We present KmerStream, a streaming algorithm for estimating the number of distinct k-mers present in high-throughput sequencing data. The algorithm runs in time linear in the size of the input and the space requirement are logarithmic in the size of the input. We derive a simple model that allows us to estimate the error rate of the sequencing experiment, as well as the genome size, using only the aggregate statistics reported by KmerStream. As an application we show how KmerStream can be used to compute the error rate of a DNA sequencing experiment. We run KmerStream on a set of 2656 whole genome sequenced individuals and compare the error rate to quality values reported by the sequencing equipment. We discover that while the quality values alone are largely reliable as a predictor of error rate, there is considerable variability in the error rates between sequencing runs, even when accounting for reported quality values.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Bioinformatics	Publication Date: Oct 28, 2014
Citations: 62	License type: cc-by

R Discovery Prime

R Discovery Prime

KmerStream: streaming algorithms for k -mer abundance estimation

Abstract

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

MaGuS: a tool for quality assessment and scaffolding of genome assemblies with Whole Genome Profiling™ Data
Mohammed-Amin Madoui ... Carole Dossat
BMC Bioinformatics | VOL. 17
Mohammed-Amin Madoui, et. al.Mohammed-Amin Madoui ... Carole Dossat
03 Mar 2016
BMC Bioinformatics | VOL. 17

KmerEstimate
Sairam Behera ... Jitender S Deogun
-
Sairam Behera, et. al.Sairam Behera ... Jitender S Deogun
15 Aug 2018
15 Aug 2018

Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data.
Aarti Desai ... Abhay Jere
PLoS ONE | VOL. 8
Aarti Desai, et. al.Aarti Desai ... Abhay Jere
12 Apr 2013
PLoS ONE | VOL. 8

MultiGeMS: detection of SNVs from multiple samples using model selection on high-throughput sequencing data.
Gabriel H Murillo ... Xinping Cui
Bioinformatics | VOL. 32
Gabriel H Murillo, et. al.Gabriel H Murillo ... Xinping Cui
18 Jan 2016
Bioinformatics | VOL. 32

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

KmerStream: streaming algorithms for k -mer abundance estimation

Abstract

Talk to us

Similar Papers

More From: Bioinformatics