Simultaneous identification of long similar substrings in large sets of sequences

Jürgen Kleffe,Friedrich Möller,Burghardt Wittig

doi:10.1186/1471-2105-8-s5-s7

Jürgen Kleffe, Friedrich Möller + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-8-s5-s7

Copy DOI

Journal: BMC Bioinformatics	Publication Date: May 1, 2007
Citations: 18	License type: CC BY 2.0

Affiliation: Charité - Universitätsmedizin Berlin

Abstract

BackgroundSequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered.ResultsWe therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at .ConclusionThe program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps.ClustDB is freely available for academic use.

Highlights

Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known
The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations
Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps

Summary

Introduction

Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Using suffix trees and suffix arrays, more efficient and exact methods of simultaneous sequence comparison exist These methods quickly identify perfect matches of substrings. By almost perfect sequence matching, we can locate BACs and shorter sequences on chromosomes, relate single ESTs to full-length cDNA and identify redundant and contaminated sequences Such methods quickly reveal large numbers of almost identical human ESTs stored in Genbank which cause largely increased multiple output of spliced alignment programs, genome browsers use to map ESTs on chromosomes. Four overlapping repeats found in chromosome IV are caused by a nine-fold tandem repeat of 3,259 base pairs that needs careful investigation Such facts are generally discovered by chance since existing methods for sequence matching cannot simultaneously compare sequence data as large as necessary. This program was further improved by a novel algorithm for match extension with errors which is the subject of this paper

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Simultaneous identification of long similar substrings in large sets of sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment.
Erik S Wright
BMC Bioinformatics | VOL. 16
Erik S WrightErik S Wright
06 Oct 2015
BMC Bioinformatics | VOL. 16

Clustering-based identification of clonally-related immunoglobulin gene sequence sets
Zhiliang Chen ... Yan Wang
Immunome Research | VOL. 6
Zhiliang Chen, et. al.Zhiliang Chen ... Yan Wang
01 Jan 2009
Immunome Research | VOL. 6

Genome and Transcriptome Sequence Resources and Effector Repertoire of Pythium myriotylum Drechsler.
Gayathri R Satheesh ... Sayuj Koyyappurath
Molecular plant-microbe interactions : MPMI | VOL. 35
Gayathri R Satheesh, et. al.Gayathri R Satheesh ... Sayuj Koyyappurath
14 Jul 2022
Molecular plant-microbe interactions : MPMI | VOL. 35

Number-Theoretic Sequence Design for Uncoordinated Resource Block Assignments in Relay-Assisted Machine-Type Communication Systems
Yaser Fouad
-
Yaser FouadYaser Fouad
13 Nov 2018
13 Nov 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Simultaneous identification of long similar substrings in large sets of sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics