Similarity Computation for XML Documents by XML Element Sequence Patterns

Haiwei Zhang,Zhongqi Liu,Xiaojie Yuan,Na Yang

doi:10.1007/978-3-540-78849-2_24

Abstract

Measuring the similarity between XML documents is the fundamental task of finding clusters in XML documents collection. In this paper, XML document is modeled as XML Element Sequence Pattern (XESP) and XESP can be extracted using less time and space than extracing other models such as tree model and frequent paths model. Similarity between XML documents will be measured based on XESPs. In view of the deficiencies encountered by ignoring the hierarchical information in frequent paths pattern models and semantic information in tree models, semantics of the elements and the hierarchical structure of the document will be taken into account when computing the similarity between XML documents by XESPs. Experimental results show that perfect clustering will be obtained with proper threshold of similarity computed by XESPs.

Full Text