Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length

Conrad J Burden,Junmei Jing,Susan R Wilson

doi:10.2202/1544-6115.1724

Abstract

The D(2) statistic, defined as the number of matches of words of some pre-specified length k, is a computationally fast alignment-free measure of biological sequence similarity. However there is some debate about its suitability for this purpose as the variability in D(2) may be dominated by the terms that reflect the noise in each of the single sequences only. We examine the extent of the problem and the effectiveness of overcoming it by using two mean-centred variants of this statistic, D(2)* and D(2c). We conclude that all three statistics are potentially useful measures of sequence similarity, for which reasonably accurate p-values can be estimated under a null hypothesis of sequences composed of identically and independently distributed letters. We show that D(2) and D(2)c, and to a somewhat lesser extent D(2)*, perform well in tests to classify moderate length query sequences as putative cis-regulatory modules.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length

Abstract

Talk to us

Similar Papers

More From: Statistical Applications in Genetics and Molecular Biology

Lead the way for us

Journal: Statistical Applications in Genetics and Molecular Biology	Publication Date: Jan 9, 2011
Citations: 12

Similar Papers

Characterizing the D2 Statistic: Word Matches in Biological Sequences
Sylvain Forêt ... Conrad J Burden
Statistical Applications in Genetics and Molecular Biology | VOL. 8
Sylvain Forêt, et. al.Sylvain Forêt ... Conrad J Burden
08 Jan 2009
Statistical Applications in Genetics and Molecular Biology | VOL. 8

Clustering Molecular Sequences with Their Components.
Matsuda ... Suharnan
Genome Informatics | VOL. 8
Matsuda, et. al. Matsuda ... Suharnan
01 Jan 1997
Genome Informatics | VOL. 8

A statistical method for alignment-free comparison of regulatory sequences
Miriam R Kantorovitz ... Saurabh Sinha
Bioinformatics | VOL. 23
Miriam R Kantorovitz, et. al.Miriam R Kantorovitz ... Saurabh Sinha
01 Jul 2007
Bioinformatics | VOL. 23

Benchmarking antibody clustering methods using sequence, structural, and machine learning similarity measures for antibody discovery applications.
Dawid Chomicz ... Konrad Krawczyk
Frontiers in Molecular Biosciences | VOL. 11
Dawid Chomicz, et. al.Dawid Chomicz ... Konrad Krawczyk
28 Mar 2024
Frontiers in Molecular Biosciences | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length

Abstract

Talk to us

Similar Papers

More From: Statistical Applications in Genetics and Molecular Biology