Abstract

Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML's semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework's distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering.

Highlights

  • XML became an official W3C Recommendation [5] on February 10, 1998, and since it has become one of the most popular ways of representing digital information

  • In [6], we presented an approach that utilized maximal frequent subtrees as patterns, grouped those patterns according to a similarity measure, and assigned XML documents to pattern groups

  • As denoted by Halevy et al [20], recent research shows that document processing requires the use of all available data and for this reason, the use of metadata such as document statistics can possibly become an important part of a real-world pattern definition, but as the presented results show, it is not sufficient to use them alone

Read more

Summary

Introduction

XML became an official W3C Recommendation [5] on February 10, 1998, and since it has become one of the most popular ways of representing digital information. XML [42], and mathematics: MathML—mathematical notation language [40] Such a rapid expansion of this standard led to the point, where huge amounts of XML are being generated every day. These data constitute a potentially important source of business and scientific knowledge, which due to its size requires automated processing. One of the most important XML mining tasks is XML clustering, which partitions a dataset into groups of presumably similar documents. It is believed that structural information contained in XML documents cannot be ignored and algorithms dedicated to processing text documents are inappropriate for XML document clustering [10].

Shortcomings of existing approaches
Our contributions
Related work
DTD approaches
Tag and path similarity approaches
Vector-based approaches
Entropy and FFT approaches
Edit distance approaches
Pattern approaches
Document clustering
Document assignment
Step 1
Step 2
Step 3
Step 4
Formal definition
The PathXP algorithm
The algorithm
Parametrization
Experimental evaluation
Datasets and experimental setup
Alternative pattern definitions
Analysis of the components of the proposed algorithm
Comparative study of clustering algorithms
Findings
Conclusions and future work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call