Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition

Tiee‐Jian Wu,Ya‐Ching Hsieh,Lung‐An Li

doi:10.1111/j.0006-341x.2001.00441.x

Tiee‐Jian Wu, Ya‐Ching Hsieh + Show 1 more

https://doi.org/10.1111/j.0006-341x.2001.00441.x

Copy DOI

Abstract

In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Past research has shown that word-based search tools are computationally efficient and can find some new functional similarities or dissimilarities invisible to other algorithms like FASTA. Recently, under the independent model of base composition, Wu, Burke, and Davison (1997, Biometrics 53, 1431 1439) characterized a family of word-based dissimilarity measures that defined distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Specifically, they introduced the use of Mahalanobis distance and standardized Euclidean distance into the study of DNA sequence dissimilarity. They showed that both distances had better sensitivity and selectivity than the commonly used Euclidean distance. The purpose of this article is to extend Mahalanobis and standardized Euclidean distances to Markov chain models of base composition. In addition, a new dissimilarity measure based on Kullback-Leibler discrepancy between frequencies of all n-words in the two sequences is introduced. Applications to real data demonstrate that Kullback-Leibler discrepancy gives a better performance than Euclidean distance. Moreover, under a Markov chain model of order kQ for base composition, where kQ is the estimated order based on the query sequence, standardized Euclidean distance performs very well. Under such a model, it performs as well as Mahalanobis distance and better than Kullback-Leibler discrepancy and Euclidean distance. Since standardized Euclidean distance is drastically faster to compute than Mahalanobis distance, in a usual workstation/PC computing environment, the use of standardized Euclidean distance under the Markov chain model of order kQ of base composition is generally recommended. However, if the user is very concerned with computational efficiency, then the use of Kullback-Leibler discrepancy, which can be computed as fast as Euclidean distance, is recommended. This can significantly enhance the current technology in comparing large datasets of DNA sequences.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition

Abstract

Talk to us

Similar Papers

More From: Biometrics

Lead the way for us

Journal: Biometrics	Publication Date: Jun 1, 2001
Citations: 126

Similar Papers

Assessment of different genetic distances in constructing cotton core subset by genotypic values
Jian-Cheng Wang ... Xin-Xian Huang
Journal of Zhejiang University SCIENCE B | VOL. 9
Jian-Cheng Wang, et. al.Jian-Cheng Wang ... Xin-Xian Huang
01 May 2008
Journal of Zhejiang University SCIENCE B | VOL. 9

A Measure of DNA Sequence Dissimilarity Based on Mahalanobis Distance between Frequencies of Words
Tiee-Jian Wu ... John P Burke
Biometrics | VOL. 53
Tiee-Jian Wu, et. al.Tiee-Jian Wu ... John P Burke
01 Dec 1997
Biometrics | VOL. 53

A strategy on constructing core collections by least distance stepwise sampling
J C Wang ... J Hu
Theoretical and Applied Genetics | VOL. 115
J C Wang, et. al.J C Wang ... J Hu
03 Apr 2007
Theoretical and Applied Genetics | VOL. 115

Comparison of clustering methods for study of genetic dissimilarity in soybean genotypes
...
African Journal of Agricultural Research | VOL. 10
, et. al. ...
12 Mar 2015
African Journal of Agricultural Research | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition

Abstract

Talk to us

Similar Papers

More From: Biometrics