Abstract

ETL (Extract Transform Load) process is the industry standard term for data extraction, transformation and loading into the Data Warehouse (DW). ETL process is the most resource demanding process in DW implementation and typically has to be evolved and maintained for the duration of the DW. To facilitate the development and maintenance of ETL processes many ETL tools have been developed featuring Graphical User Interfaces and various built-in functionalities (parallelism, logging, rich transformation libraries, documentation generation, etc.). The downside of such GUI ETL tools is that development is carried out heavily using mouse operations and less by writing programming code, which feels unnatural for some developers, especially with many similar, repetitive tasks. In this paper we present an alternative approach - an ETL framework “ETLator” based on Python scripting language where ETL tasks are defined by writing Python code. ETLator implements various typical ETL transformations and allows the user to simply and efficiently define complex ETL tasks with multiple sources and parallel tasks whilst leveraging full flexibility of Python. ETLator also provides logging and can document ETL tasks by generating data flow images. On a test case we show that ETLator simplifies ETL development and rivals the GUI approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call