Abstract
Nowadays exchanging data in XML format become more popular and have widespread application because of simple maintenance and transferring nature of XML documents. So, accelerating search within such a document ensures search engine’s efficiency. In this paper, we propose a technique for detecting the similarity in the structure of XML documents; in the following, we would cluster this document with Delaunay Triangulation method. The technique is based on the idea of representing the structure of an XML document as a time series in which each occurrence of a tag corresponds to a given impulse. So we could use Discrete Fourier Transform as a simple method to analyze these signals in frequency domain and make similarity matrices through a kind of distance measurement, in order to group them into clusters. We exploited Delaunay Triangulation as a clustering method to cluster the d-dimension points of XML documents. The results show a significant efficiency and accuracy in front of common methods.
Highlights
The main idea of this method is based on structure of XML documents; it means that, tags and position of elements in XML tree’s hierarchy are considerable
The main contribution of our approach is these steps: 1) Mapping each documents to a time series; 2) Getting Discrete Fourier Transform (DFT) and transforming each time series from time domain to frequency domain; 3) Mapping the signals related to each documents to a point in d-dimensional space; 4) Triangulation of points related to documents; 5) Clustering documents based on their triangulation
We use two external metrics named F-Measure and Purity as evaluator of our method
Summary
The main idea of this method is based on structure of XML documents; it means that, tags and position of elements in XML tree’s hierarchy are considerable. We use two external metrics named F-Measure and Purity as evaluator of our method. More information about this method is mentioned in [1]. The corpus of documents for evaluating this method is a standard corpus, which a part of that is applied This corpus has clustering metric itself which we use it as a comparison versus our external metrics. The rest of the paper is organized as follows: In Section 2, we present some information about common methods for detecting similarities and clustering documents.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.