Abstract

The use of service oriented computing paradigm and ETL (Extract-Transform-Load) technology has recently received significant attention to enable data warehouse construction and data integration. Aiming at improving scheduling and execution efficiency of service based ETL process, this paper proposes a distributed scheduling and execution framework for ETL process and a corresponding method. Firstly, add different weights to the ETL process to ensure the loading efficiency of core business data. Secondly, the scheduler selects the executors according to the performance and load, then allocates the ETL process execution request based on the greedy balance (GB) algorithm to make the load of the executor balancing. Thirdly, the executors parses ETL process to ETL services, then selects one or more executors to deploy and execute the ETL service according to the locality-aware strategy, that is, the amount of data involved and the distance of the node network which service involved, which can reduce the network overhead and improve execution efficiency. Finally, the effectiveness of the proposed method is verified by experimental comparison.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call