Abstract

BackgroundSequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered.ResultsWe therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at .ConclusionThe program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps.ClustDB is freely available for academic use.

Highlights

  • Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known

  • The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations

  • Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps

Read more

Summary

Introduction

Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Using suffix trees and suffix arrays, more efficient and exact methods of simultaneous sequence comparison exist These methods quickly identify perfect matches of substrings. By almost perfect sequence matching, we can locate BACs and shorter sequences on chromosomes, relate single ESTs to full-length cDNA and identify redundant and contaminated sequences Such methods quickly reveal large numbers of almost identical human ESTs stored in Genbank which cause largely increased multiple output of spliced alignment programs, genome browsers use to map ESTs on chromosomes. Four overlapping repeats found in chromosome IV are caused by a nine-fold tandem repeat of 3,259 base pairs that needs careful investigation Such facts are generally discovered by chance since existing methods for sequence matching cannot simultaneously compare sequence data as large as necessary. This program was further improved by a novel algorithm for match extension with errors which is the subject of this paper

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.