An Efficient Data Access Approach With Queue and Stack in Optimized Hybrid Join

Omer Aziz,Erum Mehmood,Tayyaba Anees

doi:10.1109/access.2021.3064202

Omer Aziz, Erum Mehmood + Show 1 more

Open Access

https://doi.org/10.1109/access.2021.3064202

Copy DOI

Abstract

As rapid decision making in business organizations gain in popularity, the complexity and adaptability of extract, transform, and load (ETL) process of near real-time data warehousing has dramatically increased. The most important part of near real-time data warehouse is to feed new data from different data sources on near-real-time basis. However, this new data is not in the format of the data warehouse therefore, it needs to be transformed into the required format by using transformation algorithms which is essential part of ETL process. A semi-stream join algorithm is required to implement this transformation, for this purpose a HYBRIDJOIN (hybrid join) algorithm has been presented in the literature. However, major design issue with this algorithm is that it uses a single buffer to load the disk partitions and therefore, the algorithm has to wait until the next disk partition overwrites the exiting partition in the disk buffer. As the cost of loading disk partition into disk buffer is the major cost of overall algorithm processing cost, this leaves the performance of algorithm sub-optimal. Moreover, existing approaches only considering the oldest key join attributes for finding the matches with master data and maintaining the Queue of key join attribute. However, performance can be improved if recent and oldest attributes process in parallel. This article addresses the limitation of HYBRIDJOIN by presenting two optimized new algorithms named: Parallel-Hybrid Join (P-HYBRIDJOIN) and Hybrid Join with Queue and Stack (QaS-HYBRIDJOIN). Proposed algorithms aim to reduce major processing cost that is disk I/O as well as to increase number of matching stream tuples. Both of these algorithms perform significantly better in terms of throughput and number of matching tuples as compared to existing approaches. Performance analysis and cost model for proposed algorithms show the best performance using intermittent stream data under limited resources.

Highlights

Business world has become the global village with many companies competing with one another to improve their resource management and business intelligence on basis of real-time data warehouse (RTDW) [1]–[3]
Decision support systems are dependent on RTDW, Enterprise Service Bus (ESB) [4] applications and big data [5]: a repository of complex and large data that can be analyzed for decision making
Business intelligence is dependent on latest data warehouse technology that is near real-time data warehouse [8]

Summary

Introduction

Business world has become the global village with many companies competing with one another to improve their resource management and business intelligence on basis of real-time data warehouse (RTDW) [1]–[3]. Decision support systems are dependent on RTDW, Enterprise Service Bus (ESB) [4] applications and big data [5]: a repository of complex and large data that can be analyzed for decision making. It is difficult for common business applications to process such data [6] on real-time basis. Main challenge in real-time big data is processing of data from different sources which can cause unexpected and unknown faults in information system if not handled properly [7]. The first is Data Sources hosting the data production systems that populate the data warehouse, and second is an intermediate Data Processing Area (DPA) where the cleaning and transformation of the data takes place, and last is Data Warehouse(DW) to load the transformed

Methods

Results

Conclusion