KAnalyze: a fast versatile pipelined K-mer toolkit

Peter Audano,Fredrik Vannberg

doi:10.1093/bioinformatics/btu152

Peter Audano, Fredrik Vannberg

Open Access

https://doi.org/10.1093/bioinformatics/btu152

Copy DOI

Abstract

Motivation: Converting nucleotide sequences into short overlapping fragments of uniform length, k-mers, is a common step in many bioinformatics applications. While existing software packages count k-mers, few are optimized for speed, offer an application programming interface (API), a graphical interface or contain features that make it extensible and maintainable. We designed KAnalyze to compete with the fastest k-mer counters, to produce reliable output and to support future development efforts through well-architected, documented and testable code. Currently, KAnalyze can output k-mer counts in a sorted tab-delimited file or stream k-mers as they are read. KAnalyze can process large datasets with 2 GB of memory. This project is implemented in Java 7, and the command line interface (CLI) is designed to integrate into pipelines written in any language.Results: As a k-mer counter, KAnalyze outperforms Jellyfish, DSK and a pipeline built on Perl and Linux utilities. Through extensive unit and system testing, we have verified that KAnalyze produces the correct k-mer counts over multiple datasets and k-mer sizes.Availability and implementation: KAnalyze is available on SourceForge:https://sourceforge.net/projects/kanalyze/Contact: fredrik.vannberg@biology.gatech.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Highlights

K-merizing sequence data is a necessary step for many bioinformatics applications
The application programming interface (API) is fully annotated with Javadoc comments for every class, method and field
The web pages generated from the Javadoc comments are available to API developers, and the KAnalyze manual describes how to extend the API

Summary

INTRODUCTION

K-merizing sequence data is a necessary step for many bioinformatics applications. K-mer-based approaches are used to assemble reads, detect repeats, estimate read depth, identify protein binding sites (Newburger and Bulyk, 2009), find mutations in sequencing data (Nordstrom et al, 2013) and perform a variety of other tasks. If developers choose to rewrite k-mer code, there is an additional risk of introducing bugs that can affect results. This problem is compounded when algorithms become more complex, such as counting k-mers in large datasets with limited memory. We created KAnalyze as a fast reusable k-mer toolkit capable of running on multiple platforms. The count module has a graphical mode for desktop use Because it is designed for longevity, the project is organized, documented and tested. We ran tests on several datasets and compared the results with other k-mer software, including a Perl pipeline we built for verifying results. KAnalyze makes both speed and accuracy available to k-mer applications

Pipelined components and modules

API and CLI

Count module algorithm

SOFTWARE TEST RESULTS

CONCLUSION