Abstract

Nowadays exchanging data in XML format become more popular and have widespread application because of simple maintenance and transferring nature of XML documents. So, accelerating search within such a document ensures search engine’s efficiency. In this paper, we propose a technique for detecting the similarity in the structure of XML documents; in the following, we would cluster this document with Delaunay Triangulation method. The technique is based on the idea of representing the structure of an XML document as a time series in which each occurrence of a tag corresponds to a given impulse. So we could use Discrete Fourier Transform as a simple method to analyze these signals in frequency domain and make similarity matrices through a kind of distance measurement, in order to group them into clusters. We exploited Delaunay Triangulation as a clustering method to cluster the d-dimension points of XML documents. The results show a significant efficiency and accuracy in front of common methods.

Highlights

  • The main idea of this method is based on structure of XML documents; it means that, tags and position of elements in XML tree’s hierarchy are considerable

  • The main contribution of our approach is these steps: 1) Mapping each documents to a time series; 2) Getting Discrete Fourier Transform (DFT) and transforming each time series from time domain to frequency domain; 3) Mapping the signals related to each documents to a point in d-dimensional space; 4) Triangulation of points related to documents; 5) Clustering documents based on their triangulation

  • We use two external metrics named F-Measure and Purity as evaluator of our method

Read more

Summary

Introduction

The main idea of this method is based on structure of XML documents; it means that, tags and position of elements in XML tree’s hierarchy are considerable. We use two external metrics named F-Measure and Purity as evaluator of our method. More information about this method is mentioned in [1]. The corpus of documents for evaluating this method is a standard corpus, which a part of that is applied This corpus has clustering metric itself which we use it as a comparison versus our external metrics. The rest of the paper is organized as follows: In Section 2, we present some information about common methods for detecting similarities and clustering documents.

Related Work Summary
Implements Requirements and Performing
Mapping Each Documents to a Time Series
Triangulate Points Corresponding Documents
Clustering Documents Based on Their Triangulation
Clustering Evaluation’s Parameters and Notifications
Experimental Results
Conclusions and Future Works

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.