A partition-based approach to support streaming updates over persistent data in an active datawarehouse

Abhirup Chakraborty,Ajit Singh

doi:10.1109/ipdps.2009.5161064

Abstract

Active warehousing has emerged in order to meet the high user demands for fresh and up-to-date information. Online refreshment of the source updates introduces processing and disk overheads in the implementation of the warehouse transformations. This paper considers a frequently occurring operator in active warehousing which computes the join between a fast, time varying or bursty update stream S and a persistent disk relation R, using a limited memory. Such a join operation is the crux of a number of common transformations (e.g., surrogate key assignment, duplicate detection etc) in an active data warehouse. We propose a partition-based join algorithm that minimizes the processing overhead, disk overhead and the delay in output tuples. The proposed algorithm exploits the spatio-temporal locality within the update stream, and improves the delays in output tuples by exploiting hot-spots in the range or domain of the joining attributes, and at the same time shares the I/O cost of accessing disk data of relation R over a volume of tuples from update stream S. We present experimental results showing the effectiveness of the proposed algorithm.

Full Text