Abstract

With the continuous development of big data technology, it is widely used in different computer engineering projects, and the use of data warehouses to complete the data exchange and integration of different data sources is the first step in the follow-up work. The core of the entire big data platform is Data-based. ETL (transformation and loading, extraction) technology is the basic technology for building a data warehouse [1]. Kettle is a very popular open source ETL tool. Using ETL tools can greatly improve production efficiency. ETL task scheduling is the most important part of Kettle to ensure performance. At present, ETL task scheduling in Kettle has the following problems: First, as the amount of ETL task data continues to increase and the data continues to increase, single-machine scheduling ETL results in many tasks that cannot run in time or can not run; secondly, the deployment of ETL task scheduling algorithms for Kettle application clusters is more researched. In the end, Kettle uses round-robin scheduling by default when scheduling, and the performance of all nodes in the cluster is the same by default. It is easy to cause nodes with fewer resources to run more tasks, and nodes with more resources to run less tasks. Load imbalance, which affects the running of tasks Time, causing delay. This paper studies the ETL task scheduling algorithm based on Kettle, and proposes an initial weighted task scheduling (IWTS) algorithm and a task scheduling algorithm based on threshold optimization ant colony algorithm (Task scheduling algorithm based on threshold optimization ant colony algorithm, TSTOACA). ), to solve the problem of task scheduling cluster load balancing. The initial weighted task scheduling algorithm is based on node resource weights at the beginning of task scheduling, while taking into account load balance and smoothness; threshold-based optimization ant colony algorithm is to dynamically seek through optimized ant colony algorithm when the entire cluster resource usage exceeds the threshold Global optimal task scheduling strategy to achieve load balance. Compared with the default scheduling, the adaptive load balancing algorithm proposed in this paper improves the efficiency by about 8%~12%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call