Abstract

The sizable amount of data generated by high throughput cell biology is increasing the demand on traditional computational tools in bioinformatics to handle large input datasets. Large sequence data sets create intractable search spaces that are beyond the scope of many conventional algorithms. One way to address this problem is to transform large sequence data sets to the constituent parts that characterize the features of interest (e.g. transcription factor binding sites, miRNA sites, etc.) of the problem. These features of interest take the form of k-mers in a large subset of problems in computational biology. K-mers also play an implicit role in many other bioinformatics functions from microarray probes to genomic compositional analysis.Given the increasing potential for k-mers in a wide spectrum of applications in bioinformatics we present in this paper a set of fast and efficient generic algorithms for enumerating the occurrence frequen-cies of all substrings of a given length (kmers) in whole genome sequences. Described are three algorithms of increasing complexity designed to deal with different k-mer lengths from short (couple of bases) to very long (tens of thousands of bases). They are memory based algorithms that use advanced heuristics to efficiently process large amounts of data that arise when analyzing very long genome sequences. The algorithms were tested for performance on the human, mouse, 681 bacteria and 50 archaea genome sequences. Results are described for both time and space utilization. We also describe several different experiments that demonstrate the utility of these algorithms. These algorithms can be downloaded from http://www2.kumc.edu/siddrc/bioinformatics/publication. html.

Highlights

  • K-mers play an implicit but very important role in many applications in computational biology as they form the characterizing unit of many interesting DNA sequences

  • K-mers play an implicit role in many other bioinformatics functions from microarray probes to genomic compositional analysis.Given the increasing potential for k-mers in a wide spectrum of applications in bioinformatics we present in this paper a set of fast and efficient generic algorithms for enumerating the occurrence frequen-cies of all substrings of a given length in whole genome sequences

  • A transcription factor can be represented as a subset of 8-mers or 10-mers or some other k-mer, microarray probes can be defined as a collection of 25-mers, or some other suitable kmer of choice, etc

Read more

Summary

Introduction

K-mers play an implicit but very important role in many applications in computational biology as they form the characterizing unit of many interesting DNA sequences. The algorithms that we present in this paper are generic and can assist with or incorporate to any one of the many applications that utilize k-mers in their analysis.With the advent of large-scale genome sequencing projects (over 4600 completed or ongoing genome sequencing projects worldwide, [1]), there has being an exponential growth in the number of whole genome sequence data added to the literature (over 900 fully sequenced genomes to date, [1]) With this growth comes an increasing demand for efficient computational tools for analyzing this data. Counting the number of unique k-mers in a sequence is one such tool widely used in genomic analysis. k-mer analysis has wide ranging applications varying from whole genome

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.