Tree is a data structure used to express various objects such as semistructured data and genes. When objects are represented as trees, computing tree similarity is essential for pattern recognition and retrieval. This paper considers the noisy subsequence tree recognition problem whose purpose is to recognize the original tree, given its noisy subsequence tree. Previous research on this problem relied on constrained tree edit distance to measure the dissimilarity. However, the number of relabelings must be predetermined to compute it. This paper proposes a new dissimilarity measure for this problem. Our dissimilarity measure is obtained by counting the node edit operations included in the unit-cost tree edit distance that contribute to the matching of node labels. The number of relabelings need not be specified to compute our dissimilarity measure. Moreover, our measure achieves more accurate recognition performance and faster execution speed than the constrained tree edit distance. Our measure is also useful to solve the tree inclusion problem which is the problem of deciding whether a tree includes another tree and shows the extent of approximate tree inclusion when a tree incompletely includes another tree. © 2011 Wiley Periodicals, Inc. (An early version of this work was presented at the 8th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL'07), Springer LNCS, Vol. 4881, pp. 643–652, 2007.)
Read full abstract