Abstract

The number of biological sequences in the genomic databases, such as the GenBank, have exponentially increased during the past decade. Sequence retrieval systems are required to quickly and efficiently find sequences, that are related to a query sequence. Several comparison algorithms that generally rely upon the existence of local string similarities between the query and the database sequences have been widely utilized and accepted as the basis for bio-sequence retrieval from DNA sequences databases. In this paper we describe a new method for sequence comparison based on k-mer word frequency profiles. In this algorithm, the distribution of the k-mer words found on the two sequences, defined as sequences' profiles, are treated as the signatures of sequences. This representation enables us to perform a comparison of sequence similarity using Shannon's entropy based divergence measures. The profile based search of the primate section of GenBank (GB-PRI, comprising of approximately 114,000 DNA sequences) was performed using this approach. The results obtained have established the significance and validity of a profile based genomic sequence retrieval algorithm.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.