Abstract
BackgroundTerabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available.ResultsHere we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset.ConclusionsSimrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity.
Highlights
Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp
Indexing the list of institute names directly was impossible for Sequence Search and Alignment by Hashing Algorithm (SSAHA2), BLAST and megaBLAST, so an artificial conversion from language to DNA [25] was performed
Since BLAST constrains its results to only subregions of high similarity, it was run with parameter ‘-q -1’ to allow longer match regions and equitable comparison to Simrank
Summary
Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. A rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. Molecular ecology methods often require the collection of thousands of polymer sequences (DNA, RNA or proteins) extracted from biological specimens (isolates or communities) followed by a similarity search of those sequences against one or more reference databases. A general-purpose open-source software tool to aid biologists in performing all the aforementioned tasks is not readily available. Cd-hit does not allow the decoupling of k-mer searches from the clustering, it is not used as a general-purpose similarity reporting tool
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have