Alignment-Free Sequence Comparison Using N-Dimensional Similarity Space

Ramamurthy Jayalakshmi,Munusamy Vivekanandan,Ganapathy S Natarajan,Ramanathan Natarajan

doi:10.2174/1573409911006040290

Abstract

Several alignment free sequence comparison methods are available and they use similarity, based on a particular numerical descriptor of biological sequences. Any loss of information incurred in the transformation of a sequence into a numerical descriptor affects the results. A pool of descriptors that use different algorithms in their computation is expected to suffer minimum loss of information and an attempt is made in this direction to study the similarity of DNA sequences that are homogenous or heterogeneous. Several numerical descriptors for the characterization of DNA sequences are described, based on information theoretic approach, connectivity of vertex weighted line-graphs and those derived from the matrices obtained from the graphs constructed by depicting DNA sequences as a random walk on a Euclidean plane. The information theoretic descriptors were obtained based on the L-tuple approach for the combination of different numbers of bases. The connectivity type descriptors were calculated by converting the DNA sequence into vertex weighted graphs in which vertices (nucleotide) were assigned weights based on the pKa of the bases. The graphical representations were converted into numerical descriptors by constructing matrices. Computer programs were developed to calculate seventy DNA descriptors; 560 sequences of different types of organisms were used. After initial data analysis to eliminate almost perfectly correlated descriptors, orthogonal descriptors were obtained by performing principal component analysis. Principal components (PCs) were used to construct an N-dimensional similarity space wherein the 560 sequences were clustered by k-means cluster algorithm. Five principal components (orthogonal descriptors) were extracted and found to explain 92% of data variance. The PCs were used to cluster the sequences in a five-dimensional similarity space. The similarity-based dissimilarity clustering procedure using numerical descriptors was found to be effective for studying similarity/ dissimilarity of large number of sequences.

Full Text