Abstract

Data sources (DSs) being integrated in a data warehouse frequently change their structures/schemas. As a consequence, in many cases, an already deployed ETL workflow stops its execution, yielding errors. Since in big companies the number of ETL workflows may reach dozens of thousands and since structural changes of DSs are frequent, an automatic repair of an ETL workflow after such changes is of high practical importance. In our approach, we developed a framework, called E-ETL, for handling the evolution of an ETL layer. In the framework, an ETL workflow is semi-automatically or automatically (depending on a case) repaired as the result of structural changes in DSs, so that it works with the changed DSs. E-ETL supports two different repair methods, namely: (1) user defined rules, (2) and Case-Based Reasoning. In this paper, we present how Case-Based Reasoning may be applied to repairing ETL workflows. In particular, we contribute an algorithm for selecting the most suitable case for a given ETL evolution problem. The algorithm applies a technique for reducing cases in order to make them more universal and capable of solving more problems. The algorithm has been implemented in prototype E-ETL and evaluated experimentally. The obtained results are also discussed in this paper.

Highlights

  • A data warehouse (DW) system has been developed in order to provide a framework for the integration of heterogeneous, distributed, and autonomous data storage systems deployed in a company

  • In the ETL process repair, a case can be described as Case = (DSCs, Ms), where data source change (DSC) is the set of changes in data sources and Ms is a recipe for the repair of an ETL workflow

  • Example 3 Let us consider data source Di, composed of multiple tables containing the same set of columns, e.g., The Case-Based Reasoning method for repairing ETL workflows is based on the Library of Repair Cases (LRC)

Read more

Summary

Introduction

A data warehouse (DW) system has been developed in order to provide a framework for the integration of heterogeneous, distributed, and autonomous data storage systems (typically databases) deployed in a company. We have designed and developed a prototype ETL framework, called E-ETL (Wojciechowski 2011, 2013a, b) that is able to repair its workflows automatically (in simple cases) or semi-automatically (in more complex cases) as the result of structural changes in data sources. To this end, in Wojciechowski (2015) we proposed an initial version of a repair algorithm for an ETL workflow.

Case-based reasoning for ETL repair
ETL workflow representation
DS changes and ETL repairs
Repair cases
Library of repair cases in E-ETL
E-ETL framework
Library scope
Case detection in ETL process
Completeness
Minimality
Redundant DSCs
Non-reparing modifications
Case detection algorithm
Choosing the right case
Measuring similarity of cases
Similarity of data sources
Applicability
Similarity of semantics
Searching algorithm
Storing the library of repair cases
Case reduction
DSCs reduction
Use case
Performance evaluation
Searching the best case for a variable number of DSCs
Searching the best case for a variable size of a data source
Searching the best case for a variable number of activities in an ETL process
Searching the best case for a variable size of an ETL process
Related work
10 Summary
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call