Abstract

The khmer software project provides both research and production functionality for largescale nucleic-acid sequence analysis. The software implements several novel data structures and algorithms that perform data pre-fltering for common bioinformatics tasks, including sequence mapping and de novo assembly. Development is driven by a small lab with one full-time developer (MRC), as well as several graduate students and a professor (CTB) who contribute regularly to research features. Here we describe our efforts to bring better design, testing, and more open development to the khmer software project as of version 1.1. The khmer software is developed openly at http://github.com/dib-lab/khmer/.

Highlights

  • The khmer software was born from a need to more scalably analyze short fixed-length (20–30 character) words, or “k-mers”, in large DNA sequencing data sets

  • As data sets have grown in size, approaches to analyzing k-mers have fallen behind the memory and compute scaling curves. khmer provides several functions: approximate k-mer counting using a CountMin Sketch [10], an implementation of a compressible k-mer connectivity graph [8], and a streaming lossy compression algorithm for large data sets [2]

  • We developed the khmer software as an open source project since the beginning: the software is under the BSD license, and we use GitHub for most development activities, including co-ordinating contributions, performing code review, and tagging releases

Read more

Summary

Introduction

The khmer software was born from a need to more scalably analyze short fixed-length (20–30 character) words, or “k-mers”, in large DNA sequencing data sets. Khmer provides several functions: approximate k-mer counting using a CountMin Sketch [10], an implementation of a compressible k-mer connectivity graph [8], and a streaming lossy compression algorithm for large data sets [2] These were first implemented as a part of bioinformatics research publications, but due to their broad utility have been used in several hundred data analysis publications. The main challenge for us in developing khmer has been to build a stable and reliable software project while simultaneously supporting an energetic research program in bioinformatics. This has traditionally been hard for small scientific labs due to many factors including lack of expertise and lack of sustained funding.

Objectives
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.