XML CLUSTERING FRAMEWORK BASED ON DOCUMENT CONTENT AND STRUCTURE IN A HETEROGENEOUS DIGITAL LIBRARY

Nafisse Samadi,Sri Devi Ravana

doi:10.22452/mjcs.vol36no2.2

Abstract

As textually published information is increasing in digital libraries, efficient retrieval methods are required. Textual documents in a digital library are available in various structures and contents. It is possible to represent these documents with hierarchical levels of granularity when these are organized in XML structure to improve precision by focused retrieval. By this means, contextual elements of each document can be retrieved from a known structure. One solution for retrieving these elements is clustering from a combination of Content and Structural similarities. To achieve this, a novel two-level clustering framework based on Content and Structure is proposed. The framework decomposes a document into meaningful structural units and analyzes all its rich text in its own structure. The quality of the proposed framework was experimented on a heterogeneous XML document collection, having varieties of data sources, structures, and content, be represented as a sample of a real digital library. This collection was made with capabilities to test all of our objectives. The clustering results were evaluated by the Entropy criterion. Finally, the Content and Structure clustering was compared with the usual clustering based on the Content Only to prove the efficacy of considering structural features against the existing Content Only methods in the retrieval process. The total Entropy results of the two-level Content and Structural clustering are almost twice better than the Content Only clustering approach. Consequently, the proposed framework has the ability to improve Information Retrieval systems from two points of view: i) considering the structural aspect of text-rich documents in the retrieval process, and ii) replacing the document-level retrieval with the element-level retrieval.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

XML CLUSTERING FRAMEWORK BASED ON DOCUMENT CONTENT AND STRUCTURE IN A HETEROGENEOUS DIGITAL LIBRARY

Abstract

Talk to us

Similar Papers

More From: Malaysian Journal of Computer Science

Lead the way for us

Similar Papers

Digital Library Protection Using Patent of Retrieval Process
Hideyasu Sasaki ... Yasushi Kiyoki
-
Hideyasu Sasaki, et. al.Hideyasu Sasaki ... Yasushi Kiyoki
18 Jan 2011
18 Jan 2011

Digital Library Protection Using Patent of Retrieval Process
Hideyasu Sasaki ... Yasushi Kiyoki
-
Hideyasu Sasaki, et. al.Hideyasu Sasaki ... Yasushi Kiyoki
01 Jan 2008
01 Jan 2008

Event-based retrieval from digital libraries containing data streams
...
-
, et. al. ...
01 Jan 2003
01 Jan 2003

Significance of clustering and classification applications in digital and physical libraries
Ioannis Triantafyllou ... Alexandros Koulouris
-
Ioannis Triantafyllou, et. al.Ioannis Triantafyllou ... Alexandros Koulouris
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

XML CLUSTERING FRAMEWORK BASED ON DOCUMENT CONTENT AND STRUCTURE IN A HETEROGENEOUS DIGITAL LIBRARY

Abstract

Talk to us

Similar Papers

More From: Malaysian Journal of Computer Science