Abstract

SUMMARYThis paper describes a k‐mer approach to analysing DNA data and quickly answering certain types of ad hoc biological questions. These k‐mers (short DNA strings) are stored in a conventional relational database and indexed to support efficient exact match operations. We show that k‐mers around 20–25 bases long have interesting and useful uniqueness properties that can be used to compute a ‘relatedness’ metric and also allow k‐mers to be used as ‘unique enough’ tags to identify organisms and genes. This relatedness metric is used in SQL queries that can directly answer questions such as how two related species differ, and what genes are unique to an organism. The k‐mer tags have proven useful in applications, largely metagenomic ones that can quickly process large volumes of sequencing data to say something about what organisms and genes might be present in an environmental sample. All of this work is based on simple and fast exact matches of k‐mer strings using a database, rather than conventional alignment based on inexact matches of much longer strings. These k‐mer tools provide ways of rapidly exploring large genome spaces and handling large volumes of sequence data, and complement rather than replace existing alignment and assembly tools. Copyright © 2012 John Wiley & Sons, Ltd.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.