Abstract

BackgroundSearching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches – most of which are irrelevant to the researcher.ResultsWe present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes – without the main memory constraints present in other systems.ConclusionWe present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request.

Highlights

  • Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes

  • Some search for short tandem nucleotide repeat (STNR) sequences that are commonly found throughout the genomes of higher organisms [1,2,3,4,5,6]

  • Others search for variable length tandem repeats (VLTR) and multi-period tandem repeats (MPTR) [8]

Read more

Summary

Introduction

Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. Many algorithms have been developed recently that search DNA sequences looking for various types of subsequences [1,2,3,4,5,6,7,8,9,10]. Many of the algorithms can identify repeats without a priori knowledge of the repeat pattern They do this by identifying short segments of DNA, termed words, and manipulating the properties of these words as a process moves down the length of the sequence. A search through 2.7 GBases of DNA took only 2.20 seconds per query on average while searching for homology with 177 query sequences (104,755 total bases) In this experiment, the system required a minimum homology of 2k - 1 bases (where the word size k = 10) to guarantee a match. Since the hash map is in main memory, scaling becomes a problem for multiple species searches

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.