Abstract
Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML's semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework's distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering.
Highlights
XML became an official W3C Recommendation [5] on February 10, 1998, and since it has become one of the most popular ways of representing digital information
In [6], we presented an approach that utilized maximal frequent subtrees as patterns, grouped those patterns according to a similarity measure, and assigned XML documents to pattern groups
As denoted by Halevy et al [20], recent research shows that document processing requires the use of all available data and for this reason, the use of metadata such as document statistics can possibly become an important part of a real-world pattern definition, but as the presented results show, it is not sufficient to use them alone
Summary
XML became an official W3C Recommendation [5] on February 10, 1998, and since it has become one of the most popular ways of representing digital information. XML [42], and mathematics: MathML—mathematical notation language [40] Such a rapid expansion of this standard led to the point, where huge amounts of XML are being generated every day. These data constitute a potentially important source of business and scientific knowledge, which due to its size requires automated processing. One of the most important XML mining tasks is XML clustering, which partitions a dataset into groups of presumably similar documents. It is believed that structural information contained in XML documents cannot be ignored and algorithms dedicated to processing text documents are inappropriate for XML document clustering [10].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have