Abstract

The ETL (extract, transform and load) processes are responsible for the extraction of the data from the external sources, transforming the data in order to satisfy the integration and cleanness needs and for loading the data into the data warehouse. In the data mining field, there is a special concern on using the metrics for efficient classification algorithms. One of these approaches is the one that uses metrics on partitions, based on the Shannon entropy, to study the degree of concentration of values. In this paper we show how this idea can be used in verification of the consistency of data loaded into the data warehouse by ETL processes. We calculate the Shannon entropy and Gini index on partitions induced by attribute sets and we show that these values can be used to signal a possible problem in the data extraction process. We also show how the choice of the set of attributes determining the partition can have a significant impact on the effectiveness of the method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call