Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals

Jeff Reneker,Chi-Ren Shyu

doi:10.1186/1471-2105-6-111

Jeff Reneker, Chi-Ren Shyu

Open Access

https://doi.org/10.1186/1471-2105-6-111

Copy DOI

Abstract

BackgroundSearching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches – most of which are irrelevant to the researcher.ResultsWe present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes – without the main memory constraints present in other systems.ConclusionWe present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request.

Highlights

Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes
Some search for short tandem nucleotide repeat (STNR) sequences that are commonly found throughout the genomes of higher organisms [1,2,3,4,5,6]
Others search for variable length tandem repeats (VLTR) and multi-period tandem repeats (MPTR) [8]

Summary

Introduction

Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. Many algorithms have been developed recently that search DNA sequences looking for various types of subsequences [1,2,3,4,5,6,7,8,9,10]. Many of the algorithms can identify repeats without a priori knowledge of the repeat pattern They do this by identifying short segments of DNA, termed words, and manipulating the properties of these words as a process moves down the length of the sequence. A search through 2.7 GBases of DNA took only 2.20 seconds per query on average while searching for homology with 177 query sequences (104,755 total bases) In this experiment, the system required a minimum homology of 2k - 1 bases (where the word size k = 10) to guarantee a match. Since the hash map is in main memory, scaling becomes a problem for multiple species searches

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 1, 2005
Citations: 28	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Neurofilaments in spinocerebellar ataxia type 3: blood biomarkers at the preataxic and ataxic stage in humans and mice.
Carlo Wilke ...
EMBO molecular medicine | VOL. 12
Carlo Wilke, et. al.Carlo Wilke ...
08 Jun 2020
EMBO molecular medicine | VOL. 12

SCA8 CAG/CTG Expansions, a Tale of Two TOXICities: A Unique or Common Case?
Karine Merienne ... Yvon Trottier
PLoS Genetics | VOL. 5
Karine Merienne, et. al.Karine Merienne ... Yvon Trottier
14 Aug 2009
PLoS Genetics | VOL. 5

Large Pathogenic Expansions in the SCA2 and SCA7 Genes Can Be Detected by Fluorescent Repeat-Primed Polymerase Chain Reaction Assay
Claudia Cagnoli ... Alfredo Brusco
The Journal of Molecular Diagnostics | VOL. 8
Claudia Cagnoli, et. al.Claudia Cagnoli ... Alfredo Brusco
01 Feb 2006
The Journal of Molecular Diagnostics | VOL. 8

The expansion of the CAG repeat in ataxin-2 is a frequent cause of autosomal dominant spinocerebellar ataxia.
Diego Lorenzetti ... Huda Y Zoghbi
Neurology | VOL. 49
Diego Lorenzetti, et. al.Diego Lorenzetti ... Huda Y Zoghbi
01 Oct 1997
Neurology | VOL. 49

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics