Abstract

In this paper, we discuss the state of the art and current trends in designing and optimizing ETL workflows. We explain the existing techniques for: (1) constructing a conceptual and a logical model of an ETL workflow, (2) its corresponding physical implementation, and (3) its optimization, illustrated by examples. The discussed techniques are analyzed w.r.t. their advantages, disadvantages, and challenges in the context of metrics such as autonomous behavior, support for quality metrics, and support for ETL activities as user-defined functions. We draw conclusions on still open research and technological issues in the field of ETL. Finally, we propose a theoretical ETL framework for ETL optimization.

Highlights

  • A data warehouse (DW) integrates multiple heterogeneous and distributed data sources (DSs) in order to provide a centralized and unified access to data with the end goal of decision support [1]

  • The DW stores monthly data on the available quantity of parts per supplier in the European format, which means that data coming from S2 need to be converted into the European format and data coming from S1 need to be rolled-up at the month level in order to be accepted by the DW

  • 0 2000 year deduplication, and loading. All these activities can be represented and executed in many distinctive ways, e.g., using relational operators or user-defined functions, which results in a complex design of an ETL workflow

Read more

Summary

Introduction

A data warehouse (DW) integrates multiple heterogeneous and distributed data sources (DSs) in order to provide a centralized and unified access to data with the end goal of decision support [1]. The tasks executed in an ETL workflow include among others: (1) extracting and filtering data from data sources, (2) transforming data into a common data model, (3) cleaning data in order to remove errors and null values, (4) standardizing values, (5) integrating cleaned data into one common consistent data set, (6) removing duplicates, (7) sorting and computing summaries, and (8) loading data into a DW These tasks can be implemented by means of SQL commands, predefined components, or user-defined functions (UDFs) written in multiple programming languages. The ETL research community has proposed several methods for designing a conceptual model of an ETL workflow, which led to its semantically equivalent logical model, physical implementation, and its optimized run-time version. To study and understand the existing techniques to construct a conceptual and a logical model, its corresponding physical implementation, and optimization of an ETL workflow as well as to evaluate them on the basis of some metrics that we proposed

Running example
Conceptual model
Graph-based conceptual model
UML-based conceptual model
Ontology-based conceptual model
BPMN-based conceptual model
Summary
Logical model
Graph-based logical model
From conceptual to logical model
Implementation based on reusable templates
Implementation based on BPEL
Optimization of an ETL workflow
Dependency graph for optimizing an ETL workflow
Parallelism for optimizing an ETL workflow
Parallelism in traditional data flow
Parallelism in an ETL workflow
Quality metrics for ETL workflows
Statistics for workflow optimization
Commercial ETL tools
Conclusions
ETL workflow development: summary
ETL workflow optimization: summary
Open Issues
Extendible theoretical ETL Framework
Findings
Future work: our approach to ETL optimization
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call