Tree is one of the most common and well-studied data structures in computer science. Measuring the similarity of such structures is key to analyzing this type of data. However, measuring tree similarity is not trivial due to the inherent complexity of trees and the ensuing large search space. Tree kernel, a state of the art similarity measurement of trees, represents trees as vectors in a feature space and measures similarity in this space. When different features are used, different algorithms are required. Tree edit distance is another widely used similarity measurement of trees. It measures similarity through edit operations needed to transform one tree to another. Without any restrictions on edit operations, the computation cost is too high to be applicable to large volume of data. To improve efficiency of tree edit distance, some approximations were introduced into tree edit distance. However, their effectiveness can be compromised. In this paper, a novel approach to measuring tree similarity is presented. Trees are represented as multidimensional sequences and their similarity is measured on the basis of their sequence representations. Multidimensional sequences have their sequential dimensions and spatial dimensions. We measure the sequential similarity by the all common subsequences sequence similarity measurement or the longest common subsequence measurement, and measure the spatial similarity by dynamic time warping. Then we combine them to give a measure of tree similarity. A brute force algorithm to calculate the similarity will have high computational cost. In the spirit of dynamic programming two efficient algorithms are designed for calculating the similarity, which have quadratic time complexity. The new measurements are evaluated in terms of classification accuracy in two popular classifiers (k-nearest neighbor and support vector machine) and in terms of search effectiveness and efficiency in k-nearest neighbor similarity search, using three different data sets from natural language processing and information retrieval. Experimental results show that the new measurements outperform the benchmark measures consistently and significantly.
Read full abstract