With the explosive growth of heterogeneous XML sources, data inconsistency has become a serious problem that leads to ineffective business operations and poor decision-making. To address such inconsistency, XML functional dependencies (XFDs) have been proposed to constrain the data integrity of a source. Unfortunately, existing approaches to XFDs have insufficiently addressed data inconsistency arising from both semantic and structural inconsistencies inherent in heterogeneous XML data sources. This paper proposes a novel approach, called SCAD, to discover anomalies from a given source, which is essential to address prevalent inconsistencies in XML data. Our contribution is twofold. First, we introduce a new type of path and value-based data constraint, called XML Conditional Structural Dependency (XCSD), whereby (i) the paths in XCSD approximately represent groups of similar paths in sources to express constraints on objects with diverse structures; while (ii) the values bound to particular elements express constraints with conditional semantics. XCSD can capture data inconsistency disregarded by XFDs.Second, our proposed SCAD is used to discover XCSDs from a given source. Our approach exploits the semantics of data structures to detect similar paths from the sources, from which a data summary is constructed as an input for the discovery process. This aims to avoid returning redundant data rules due to structural inconsistencies. During the discovery process, SCAD employs semantics hidden in the data values to discover XCSDs. To evaluate our proposed approach, experiments and case studies were conducted on synthetic datasets which contain structural diversity causing XML data inconsistency. The experimental results show that SCAD can discover more dependencies and the dependencies found convey more meaningful semantics than those of the existing XFDs.
Read full abstract