Data Cleaning Model for XML Datasets using Conditional Dependencies

Mohammed Ragheb Hakawati,Yasmin Yacob,Mustafa M.Khalifa Jabiry,Eiad Syaf Alhudiani,Rafikha Aliana A Raof

doi:10.24018/ejece.2020.4.1.163

Abstract

Data Cleaning as an essential phase to enhance the overall quality used for decades with different data models, the majority handled a relational dataset as the most dominant data model. However, the XML data model, besides the relational data model considered the most data model commonly used for storing, retrieving, and querying valuable data. In this paper, we introduce a model for detecting and repairing XML data inconsistencies using a set of conditional dependencies. Detecting inconsistencies will be done by joining the existed data source with a set of patterns tableaus as conditional dependencies and then update these values to match the proper patterns using a set of SQL statements. This research considered the final phase for a cleaning model introduced for XML datasets by firstly mapping the XML document to a set of related tables then discovering a set of conditional dependencies (Functional and Inclusions) and finally then applying the following algorithms as a closing step of quality enhancement.

Highlights

Data is becoming the lifeblood for companies as various database systems, such as Decision Support Systems, Customer Relationship Management, Big Data industry Projects, and Internet of things systems, are being used; valuable information and expertise can be obtained from a large amount of data
Authors in [4] provide in-depth analysis to answer the question “Is the quality of XML documents found on the web sufficient to apply XML technologies like XQuery, XPath, and XSLT?” The results show that on the web, 58% of the existing documents over the web are of XML file format, one-third of these documents accompanying valid XML Schema Definition (XSD) or Document Type Definition (DTD)
Regarding the running time of discovering patterns, the scalability shows that the time elapsed is almost constant and depends on the number of tree tuples (Table Tuples)

Summary

Introduction

Data is becoming the lifeblood for companies as various database systems, such as Decision Support Systems, Customer Relationship Management, Big Data industry Projects, and Internet of things systems, are being used; valuable information and expertise can be obtained from a large amount of data. According to studies and reports published by V12-Data in 2015, the cost of bad data could be considerably higher than 12% of lost revenue. The vast majority of organizations (86 %) acknowledged that their data might be wrong in some way, whereas, 44% of businesses and organizations reported missing or incomplete data as the most frequent issue alongside obsolete information [2]. Operational (causing customer and employee dissatisfaction and increased costs), Tactical (affecting decision making and causing mistrust), and Strategic Impacts (affecting the overall organization’s strategy). Any system or enterprise that heavily relies on data is prone to experience problems if the data being handled does not possess the expected data quality [3]

Objectives

Methods

Findings

Conclusion