Optimum-time, Optimum-space, Algorithms for k-mer Analysis of Whole Genome Sequences

Sumedha Gunewardena

doi:10.17303/jbcg.2014.1.101

Abstract

The sizable amount of data generated by high throughput cell biology is increasing the demand on traditional computational tools in bioinformatics to handle large input datasets. Large sequence data sets create intractable search spaces that are beyond the scope of many conventional algorithms. One way to address this problem is to transform large sequence data sets to the constituent parts that characterize the features of interest (e.g. transcription factor binding sites, miRNA sites, etc.) of the problem. These features of interest take the form of k-mers in a large subset of problems in computational biology. K-mers also play an implicit role in many other bioinformatics functions from microarray probes to genomic compositional analysis.Given the increasing potential for k-mers in a wide spectrum of applications in bioinformatics we present in this paper a set of fast and efficient generic algorithms for enumerating the occurrence frequen-cies of all substrings of a given length (kmers) in whole genome sequences. Described are three algorithms of increasing complexity designed to deal with different k-mer lengths from short (couple of bases) to very long (tens of thousands of bases). They are memory based algorithms that use advanced heuristics to efficiently process large amounts of data that arise when analyzing very long genome sequences. The algorithms were tested for performance on the human, mouse, 681 bacteria and 50 archaea genome sequences. Results are described for both time and space utilization. We also describe several different experiments that demonstrate the utility of these algorithms. These algorithms can be downloaded from http://www2.kumc.edu/siddrc/bioinformatics/publication. html.

Highlights

K-mers play an implicit but very important role in many applications in computational biology as they form the characterizing unit of many interesting DNA sequences
K-mers play an implicit role in many other bioinformatics functions from microarray probes to genomic compositional analysis.Given the increasing potential for k-mers in a wide spectrum of applications in bioinformatics we present in this paper a set of fast and efficient generic algorithms for enumerating the occurrence frequen-cies of all substrings of a given length in whole genome sequences
A transcription factor can be represented as a subset of 8-mers or 10-mers or some other k-mer, microarray probes can be defined as a collection of 25-mers, or some other suitable kmer of choice, etc

Summary

Introduction

K-mers play an implicit but very important role in many applications in computational biology as they form the characterizing unit of many interesting DNA sequences. The algorithms that we present in this paper are generic and can assist with or incorporate to any one of the many applications that utilize k-mers in their analysis.With the advent of large-scale genome sequencing projects (over 4600 completed or ongoing genome sequencing projects worldwide, [1]), there has being an exponential growth in the number of whole genome sequence data added to the literature (over 900 fully sequenced genomes to date, [1]) With this growth comes an increasing demand for efficient computational tools for analyzing this data. Counting the number of unique k-mers in a sequence is one such tool widely used in genomic analysis. k-mer analysis has wide ranging applications varying from whole genome

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Optimum-time, Optimum-space, Algorithms for k-mer Analysis of Whole Genome Sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: jbcg

Lead the way for us

Journal: jbcg	Publication Date: Jan 1, 2019
License type: cc-by

Similar Papers

Analysis of common k-mers for whole genome sequences using SSB-tree.
Jeong-Hyeon Choi ... Hwan-Gue Cho
Genome Informatics | VOL. 13
Jeong-Hyeon Choi, et. al.Jeong-Hyeon Choi ... Hwan-Gue Cho
01 Jan 2002
Genome Informatics | VOL. 13

A new genotype of bovine leukemia virus in South America identified by NGS-based whole genome sequencing and molecular evolutionary genetic analysis.
Meripet Polat ... Taku Miyasaka
Retrovirology | VOL. 13
Meripet Polat, et. al.Meripet Polat ... Taku Miyasaka
12 Jan 2016
Retrovirology | VOL. 13

Abstract 3574: Analysis of whole genome and transcriptome sequencing in single cell
Nak-Jung Kwon ... Ahreum Seong
Cancer Research | VOL. 74
Nak-Jung Kwon, et. al.Nak-Jung Kwon ... Ahreum Seong
30 Sep 2014
Abstract 3574: Analysis of whole genome and transcriptome sequencing in single cell
Nak-Jung Kwon ... Ahreum Seong

전유전체(Whole gerlome) 서열 분석과 가시화를 위한 워크벤치 개발
Jeong-Hyeon Choe ... Hwan-Gyu Jo
The KIPS Transactions:PartA | VOL. 9A
Jeong-Hyeon Choe, et. al.Jeong-Hyeon Choe ... Hwan-Gyu Jo
01 Sep 2002
The KIPS Transactions:PartA | VOL. 9A

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimum-time, Optimum-space, Algorithms for k-mer Analysis of Whole Genome Sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: jbcg