Abstract

This study aims to increase ETL process efficiency »ud reduce processing time by applying the method of Change Data Capture (CDC) in distributed system using Hadoop Distributed file System (HDFS) and Apache Spark in the data warehouse of Learning Analytics system of Universitas Indonesia. Usually, increases in I lie number of records in the data source result in an increase in ETL processing time for the data warehouse system. This condition occurs as a result of inefficient ETL process using the full load method. Using the tull load method, ETL has to process the same number of records as the number of records in the data sources. The proposed ETL model design with the application of CDC method using HDFS and Apache Spark can reduce the amount of data in the ETL process. Consequently, the process becomes more efficient and the ETL processing time Is reduced approximately 53% in average.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call