Sequence Similarity Measures Research Articles

Tree is one of the most common and well-studied data structures in computer science. Measuring the similarity of such structures is key to analyzing this type of data. However, measuring tree similarity is not trivial due to the inherent complexity of trees and the ensuing large search space. Tree kernel, a state of the art similarity measurement of trees, represents trees as vectors in a feature space and measures similarity in this space. When different features are used, different algorithms are required. Tree edit distance is another widely used similarity measurement of trees. It measures similarity through edit operations needed to transform one tree to another. Without any restrictions on edit operations, the computation cost is too high to be applicable to large volume of data. To improve efficiency of tree edit distance, some approximations were introduced into tree edit distance. However, their effectiveness can be compromised. In this paper, a novel approach to measuring tree similarity is presented. Trees are represented as multidimensional sequences and their similarity is measured on the basis of their sequence representations. Multidimensional sequences have their sequential dimensions and spatial dimensions. We measure the sequential similarity by the all common subsequences sequence similarity measurement or the longest common subsequence measurement, and measure the spatial similarity by dynamic time warping. Then we combine them to give a measure of tree similarity. A brute force algorithm to calculate the similarity will have high computational cost. In the spirit of dynamic programming two efficient algorithms are designed for calculating the similarity, which have quadratic time complexity. The new measurements are evaluated in terms of classification accuracy in two popular classifiers (k-nearest neighbor and support vector machine) and in terms of search effectiveness and efficiency in k-nearest neighbor similarity search, using three different data sets from natural language processing and information retrieval. Experimental results show that the new measurements outperform the benchmark measures consistently and significantly.

Read full abstract

Motivation: The identity of cells and tissues is to a large degree governed by transcriptional regulation. A major part is accomplished by the combinatorial binding of transcription factors at regulatory sequences, such as enhancers. Even though binding of transcription factors is sequence-specific, estimating the sequence similarity of two functionally similar enhancers is very difficult. However, a similarity measure for regulatory sequences is crucial to detect and understand functional similarities between two enhancers and will facilitate large-scale analyses like clustering, prediction and classification of genome-wide datasets.Results: We present the standardized alignment-free sequence similarity measure N2, a flexible framework that is defined for word neighbourhoods. We explore the usefulness of adding reverse complement words as well as words including mismatches into the neighbourhood. On simulated enhancer sequences as well as functional enhancers in mouse development, N2 is shown to outperform previous alignment-free measures. N2 is flexible, faster than competing methods and less susceptible to single sequence noise and the occurrence of repetitive sequences. Experiments on the mouse enhancers reveal that enhancers active in different tissues can be separated by pairwise comparison using N2.Conclusion: N2 represents an improvement over previous alignment-free similarity measures without compromising speed, which makes it a good candidate for large-scale sequence comparison of regulatory sequences.Availability: The software is part of the open-source C++ library SeqAn (www.seqan.de) and a compiled version can be downloaded at http://www.seqan.de/projects/alf.htmlContact: goeke@molgen.mpg.de; vingron@molgen.mpg.deSupplementary information: Supplementary data are available at Bioinformatics online.

Read full abstract

Sequence Similarity Measures Research Articles

Related Topics

Articles published on Sequence Similarity Measures

On measuring similarity for sequences of itemsets

SSM-DENCLUE : Enhanced Approach for Clustering of Sequential Data: Experiments and Test Cases

A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering

An alternative approach for clustering web user sessions considering sequential information

Comparative and Phylogenomic Evidence That the Alphaproteobacterium HIMB59 Is Not a Member of the Oceanic SAR11 Clade

Are we what we do? Exploring group behaviour through user-defined event-sequence similarity

Formatt: Correcting protein multiple structural alignments by incorporating sequence alignment.

A novel hierarchical clustering algorithm for gene sequences

Querying event sequences by exact match or similarity search: Design and empirical evaluation

A Multidimensional Sequence Approach to Measuring Tree Similarity

Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts

A Novel Similarity Measure for Sequence Data

Efficient bitmap-based indexing of time-based interval sequences

Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length

State-space dynamics distance for clustering sequential data

A New Similarity Metric for Sequential Data

Objective sequence-based subfamily classifications of mouse homeodomains reflect their in vitro DNA-binding preferences

Using Local Alignments for Relation Recognition

A visual framework for sequence analysis using n-grams and spectral rearrangement

V3 Loop Sequence Space Analysis Suggests Different Evolutionary Patterns of CCR5- and CXCR4-Tropic HIV

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Sequence Similarity Measures Research Articles

Related Topics

Articles published on Sequence Similarity Measures

On measuring similarity for sequences of itemsets

SSM-DENCLUE : Enhanced Approach for Clustering of Sequential Data: Experiments and Test Cases

A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering

An alternative approach for clustering web user sessions considering sequential information

Comparative and Phylogenomic Evidence That the Alphaproteobacterium HIMB59 Is Not a Member of the Oceanic SAR11 Clade

Are we what we do? Exploring group behaviour through user-defined event-sequence similarity

Formatt: Correcting protein multiple structural alignments by incorporating sequence alignment.

A novel hierarchical clustering algorithm for gene sequences

Querying event sequences by exact match or similarity search: Design and empirical evaluation

A Multidimensional Sequence Approach to Measuring Tree Similarity

Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts

A Novel Similarity Measure for Sequence Data

Efficient bitmap-based indexing of time-based interval sequences

Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length

State-space dynamics distance for clustering sequential data

A New Similarity Metric for Sequential Data

Objective sequence-based subfamily classifications of mouse homeodomains reflect their in vitro DNA-binding preferences

Using Local Alignments for Relation Recognition

A visual framework for sequence analysis using n-grams and spectral rearrangement

V3 Loop Sequence Space Analysis Suggests Different Evolutionary Patterns of CCR5- and CXCR4-Tropic HIV