Abstract

Continuously extracting and integrating changing data from various heterogeneous systems based on an appropriate data extraction model is the key to data sharing and integration and also the key to building an incremental data warehouse for data analysis. The traditional data capture method based on timestamp changes is plagued with anomalies in the data extraction process, which leads to data extraction failure and affects the efficiency of data extraction. To address the above problems, this paper improves the traditional data capture model based on timestamp increments and proposes VTWM, an incremental data extraction model based on variable time-windows, based on the idea of extracting a small number of duplicate records before removing duplicate values. The model reduces the influence of abnormalities on data extraction, improves the reliability of the traditional data extraction ETL processes, and improves the data extraction efficiency.

Highlights

  • In enterprises or government departments, due to the different development times and different development agencies, there are often multiple heterogeneous information systems running on different hardware and software platforms at the same time

  • There are mainly three main approaches for the study of change data capture: Log-based approach [9,10,11,12,13], Trigger-based approach [2,15] and Timestamp-based approach[3,5,16] A full-table comparison incremental extraction method based on database transaction log files, called L-C incremental extraction method, is proposed in [10]

  • Incremental timestamp-based data extraction is achieved by maintaining an additional database table to store the time of the last data extraction [5]

Read more

Summary

Introduction

In enterprises or government departments, due to the different development times and different development agencies, there are often multiple heterogeneous information systems running on different hardware and software platforms at the same time. The main contribution of VTWM proposed in this paper is that it alleviates the problem that the database rollback and the efficiency of the de-duplication operation decrease with the increase of the data volume of the target table due to the exception of the up-extraction operation of the traditional model. It reduces the impact of anomalies on data extraction by taking into account the efficiency of data extraction under the premise of ensuring reliability.

Related Work
Traditional incremental timestamp-based data extraction model
Definitions
Data deduplication
Implementation
Experiment and Analysis
Experimental environment
Comparison and analysis of reliability
Comparison and analysis of time performance
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.