An improved string composition method for sequence comparison

Guoqing Lu,Shunpu Zhang,Xiang Fang

doi:10.1186/1471-2105-9-s6-s15

Guoqing Lu, Shunpu Zhang + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-9-s6-s15

Copy DOI

Journal: BMC bioinformatics	Publication Date: May 28, 2008
Citations: 59	License type: CC BY 2.0

Affiliation: University of Nebraska–Lincoln

Abstract

BackgroundHistorically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational limitations. Consequently, alignment-free methods have been explored as important alternatives in estimating sequence similarity. Of the alignment-free methods, the string composition vector (CV) methods, which use the frequencies of nucleotide or amino acid strings to represent sequence information, show promising results in genome sequence comparison of prokaryotes. The existing CV-based methods, however, suffer certain statistical problems, thereby underestimating the amount of evolutionary information in genetic sequences.ResultsWe show that the existing string composition based methods have two problems, one related to the Markov model assumption and the other associated with the denominator of the frequency normalization equation. We propose an improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison. Phylogenetic analyses using both simulated and experimental data sets demonstrate that our new method is more robust compared with existing counterparts and comparable in robustness with alignment-based methods.ConclusionWe observed two problems existing in the currently used string composition methods and proposed a new robust method for the estimation of evolutionary information of genetic sequences. In addition, we discussed that it might not be necessary to use relatively long strings to build a complete composition vector (CCV), due to the overlapping nature of vector strings with a variable length. We suggested a practical approach for the choice of an optimal string length to construct the CCV.

Highlights

Two categories of computational algorithms have been applied to sequence comparison–one of the most fundamental issues in bioinformatics
When comparing trees generated from different methods, both the improved CCV (ICCV) tree and the tree constructed by the H5N1 Working Group have exactly the same topology, which suggests that the ICCV method is more dependable than the existing complete composition vector (CCV) method
We show that the existing composition vector (CV) and CCV methods underestimate the evolutionary information contained in a DNA sequence due to the Markov model assumption and the denominator used for the normalization of observed string frequencies

Summary

Introduction

Two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Dominantly used by biologists, possesses both fundamental as well as computational limitations. The methods for sequence comparison are classified into two categories, alignment-based and alignment-free. The alignmentbased sequence analysis methods have both fundamental and computational limitations [1,2,3,4] These methods cannot deal with changes like chromosome reversal or gene translocation. They encounter difficulties in aligning dissimilar sequences. Considerable efforts have been made to seek for alternative, i.e., alignment-free, methods for sequence comparison

Methods

Results

Discussion

Conclusion