Abstract
A new method for the search of local repeats in long DNA sequences, such as complete genomes, is presented. It detects a large variety of repeats varying in length from one to several hundred bases, which may contain many mutations. By mutations we mean substitutions, insertions or deletions of one or more bases. The method is based on counting occurrences of short words (3-12 bases) in sequence fragments called windows. A score is computed for each window, based on calculating exact word occurrence probabilities for all the words of a given length in the window. The probabilities are defined using a Bernoulli model (independent letters) for the sequence, using the actual letter frequencies from each window. A plot of the probabilities along the sequence for high-scoring windows facilitates the identification of the repeated patterns. We applied the method to the 1.87 Mb sequence of chromosome 4 of Arabidopsis thaliana and to the complete genome of Bacillus subtilis (4.2 Mb). The repeats that we found were classified according to their size, number of occurrences, distance between occurrences, and location with respect to genes. The method proves particularly useful in detecting long, inexact repeats that are local, but not necessarily tandem. The method is implemented as a C program called EXCEP, which is available on request from the authors.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.