Clustering XML documents by patterns

Maciej Piernik,Dariusz Brzezinski,Tadeusz Morzy

doi:10.1007/s10115-015-0820-0

Maciej Piernik, Dariusz Brzezinski + Show 1 more

Open Access

https://doi.org/10.1007/s10115-015-0820-0

Copy DOI

Abstract

Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML's semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework's distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering.

Highlights

XML became an official W3C Recommendation [5] on February 10, 1998, and since it has become one of the most popular ways of representing digital information
In [6], we presented an approach that utilized maximal frequent subtrees as patterns, grouped those patterns according to a similarity measure, and assigned XML documents to pattern groups
As denoted by Halevy et al [20], recent research shows that document processing requires the use of all available data and for this reason, the use of metadata such as document statistics can possibly become an important part of a real-world pattern definition, but as the presented results show, it is not sufficient to use them alone

Summary

Introduction

XML became an official W3C Recommendation [5] on February 10, 1998, and since it has become one of the most popular ways of representing digital information. XML [42], and mathematics: MathML—mathematical notation language [40] Such a rapid expansion of this standard led to the point, where huge amounts of XML are being generated every day. These data constitute a potentially important source of business and scientific knowledge, which due to its size requires automated processing. One of the most important XML mining tasks is XML clustering, which partitions a dataset into groups of presumably similar documents. It is believed that structural information contained in XML documents cannot be ignored and algorithms dedicated to processing text documents are inappropriate for XML document clustering [10].

Shortcomings of existing approaches

Our contributions

Related work

DTD approaches

Tag and path similarity approaches

Vector-based approaches

Entropy and FFT approaches

Edit distance approaches

Pattern approaches

Document clustering

Document assignment

Step 1

Step 2

Step 3

Step 4

Formal definition

The PathXP algorithm

The algorithm

Parametrization

Experimental evaluation

Datasets and experimental setup

Alternative pattern definitions

Analysis of the components of the proposed algorithm

Comparative study of clustering algorithms

Findings

Conclusions and future work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Knowledge and information systems	Publication Date: Jan 23, 2015
Citations: 40	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Clustering XML documents by patterns

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Knowledge and information systems

Lead the way for us

Similar Papers

Significance test for comparing complex microbial community fingerprints using pairwise similarity measures
Siegfried Kropf ... Kornelia Smalla
Journal of microbiological methods | VOL. 57
Siegfried Kropf, et. al.Siegfried Kropf ... Kornelia Smalla
19 Feb 2004
Journal of microbiological methods | VOL. 57

Interaction pattern and motif mining method for doctor-patient multi-modal dialog analysis
Kenji Mase ... Tomio Suzuki
-
Kenji Mase, et. al.Kenji Mase ... Tomio Suzuki
06 Nov 2009
06 Nov 2009

Sphere-sphere intersection for investment portfolio diversification — A new data-driven cluster analysis
Michel Ferreira Cardia Haddad
MethodsX | VOL. 6
Michel Ferreira Cardia HaddadMichel Ferreira Cardia Haddad
01 Jan 2019
MethodsX | VOL. 6

Sphere-sphere intersection for investment portfolio diversification - A new data-driven cluster analysis.

-

25 May 2019
25 May 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Clustering XML documents by patterns

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Knowledge and information systems