Answering biological questions by querying k‐mer databases

Paul Greenfield,Uwe Roehm

doi:10.1002/cpe.2938

Abstract

SUMMARYThis paper describes a k‐mer approach to analysing DNA data and quickly answering certain types of ad hoc biological questions. These k‐mers (short DNA strings) are stored in a conventional relational database and indexed to support efficient exact match operations. We show that k‐mers around 20–25 bases long have interesting and useful uniqueness properties that can be used to compute a ‘relatedness’ metric and also allow k‐mers to be used as ‘unique enough’ tags to identify organisms and genes. This relatedness metric is used in SQL queries that can directly answer questions such as how two related species differ, and what genes are unique to an organism. The k‐mer tags have proven useful in applications, largely metagenomic ones that can quickly process large volumes of sequencing data to say something about what organisms and genes might be present in an environmental sample. All of this work is based on simple and fast exact matches of k‐mer strings using a database, rather than conventional alignment based on inexact matches of much longer strings. These k‐mer tools provide ways of rapidly exploring large genome spaces and handling large volumes of sequence data, and complement rather than replace existing alignment and assembly tools. Copyright © 2012 John Wiley & Sons, Ltd.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Concurrency and Computation: Practice and Experience	Publication Date: Oct 11, 2012
Citations: 22	License type: unspecified-oa

R Discovery Prime

R Discovery Prime

Answering biological questions by querying k‐mer databases

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience

Lead the way for us

Similar Papers

SAMSA2: a standalone metatranscriptome analysis pipeline
Samuel T Westreich ... Michelle L Treiber
BMC Bioinformatics | VOL. 19
Samuel T Westreich, et. al.Samuel T Westreich ... Michelle L Treiber
21 May 2018
BMC Bioinformatics | VOL. 19

Analysis of plant microbe interactions in the era of next generation sequencing technologies.
Claudia Knief
Frontiers in Plant Science | VOL. 5
Claudia KniefClaudia Knief
21 May 2014
Frontiers in Plant Science | VOL. 5

WheatExp: an RNA-seq expression database for polyploid wheat
Stephen Pearce ... Jorge Dubcovsky
BMC Plant Biology | VOL. 15
Stephen Pearce, et. al.Stephen Pearce ... Jorge Dubcovsky
01 Dec 2015
BMC Plant Biology | VOL. 15

Accurate read-based metagenome characterization using a hierarchical suite of unique signatures.
Tracey Allen K Freitas ... Po-E Li
Nucleic Acids Research | VOL. 43
Tracey Allen K Freitas, et. al.Tracey Allen K Freitas ... Po-E Li
12 Mar 2015
Nucleic Acids Research | VOL. 43

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Answering biological questions by querying k‐mer databases

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience