Abstract

BackgroundDNA sequence can be viewed as an unknown language with words as its functional units. Given that most sequence alignment algorithms such as the motif discovery algorithms depend on the quality of background information about sequences, it is necessary to develop an ab initio algorithm for extracting the “words” based only on the DNA sequences.MethodsWe considered that non-uniform distribution and integrity were two important features of a word, based on which we developed an ab initio algorithm to extract “DNA words” that have potential functional meaning. A Kolmogorov-Smirnov test was used for consistency test of uniform distribution of DNA sequences, and the integrity was judged by the sequence and position alignment. Two random base sequences were adopted as negative control, and an English book was used as positive control to verify our algorithm. We applied our algorithm to the genomes of Saccharomyces cerevisiae and 10 strains of Escherichia coli to show the utility of the methods.ResultsThe results provide strong evidences that the algorithm is a promising tool for ab initio building a DNA dictionary.ConclusionsOur method provides a fast way for large scale screening of important DNA elements and offers potential insights into the understanding of a genome.Electronic supplementary materialThe online version of this article (doi:10.1186/s12976-016-0028-3) contains supplementary material, which is available to authorized users.

Highlights

  • DNA sequence can be viewed as an unknown language with words as its functional units

  • A lot of sequence alignment algorithms, such as motif discovery algorithms [1, 2], were developed for this purpose. These algorithms are limited in two ways: 1) their performances depend on the quality of available background information about the sequences, that is the extent of knowledge about biological function [1]; and 2) they can not analyze genomic regions with unknown functions

  • We developed an algorithm to extract meaningful DNA words based on these two features: non-uniformity and integrity

Read more

Summary

Introduction

DNA sequence can be viewed as an unknown language with words as its functional units. Given that most sequence alignment algorithms such as the motif discovery algorithms depend on the quality of background information about sequences, it is necessary to develop an ab initio algorithm for extracting the “words” based only on the DNA sequences. A lot of sequence alignment algorithms, such as motif discovery algorithms [1, 2], were developed for this purpose These algorithms are limited in two ways: 1) their performances depend on the quality of available background information about the sequences, that is the extent of knowledge about biological function [1]; and 2) they can not analyze genomic regions with unknown functions. Some ab initio methods have been developed in the literature, such as k-mer [3], relative entropy [4], and information content [5,6,7,8].

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.