XStruct: Efficient Schema Extraction from Multiple and Large XML Documents

J Hegewald,M Weis,F Naumann

doi:10.1109/icdew.2006.166

Abstract

XML is the de facto standard format for data exchange on the Web. While it is fairly simple to generate XML data, it is a complex task to design a schema and then guarantee that the generated data is valid according to that schema. As a consequence much XML data does not have a schema or is not accompanied by its schema. In order to gain the benefits of having a schema - efficient querying and storage of XML data, semantic verification, data integration, etc.- this schema must be extracted. In this paper we present an automatic technique, XStruct, for XML Schema extraction. Based on ideas of [5], XStruct extracts a schema for XML data by applying several heuristics to deduce regular expressions that are 1-unambiguous and describe each element’s contents correctly but generalized to a reasonable degree. Our approach features several advantages over known techniques: XStruct scales to very large documents (beyond 1GB) both in time and memory consumption; it is able to extract a general, complete, correct, minimal, and understandable schema for multiple documents; it detects datatypes and attributes. Experiments confirm these features and properties.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

XStruct: Efficient Schema Extraction from Multiple and Large XML Documents

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

XML Data Storage and Query Optimization in Relational Database by XPath Processing Model
...
Journal of Software | VOL. 8
, et. al. ...
04 Jan 2013
Journal of Software | VOL. 8

AXIS: A XML Schema Integration System
Bipin Sakamuri ... S Bhowmick
-
Bipin Sakamuri, et. al.Bipin Sakamuri ... S Bhowmick
01 Jan 2003
01 Jan 2003

Query XML Data in RDBMS
Xiangyu Hu ... Xin Lian
-
Xiangyu Hu, et. al.Xiangyu Hu ... Xin Lian
01 Nov 2012
01 Nov 2012

Efficient schema extraction from a large collection of XML documents
Guangming Xing ... Vijayeandra Parthepan
-
Guangming Xing, et. al.Guangming Xing ... Vijayeandra Parthepan
24 Mar 2011
24 Mar 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

XStruct: Efficient Schema Extraction from Multiple and Large XML Documents

Abstract

Talk to us

Similar Papers