Abstract

Large data warehouses store interdependent tables that are updated independently in response to business logic changes or late arrival of critical data. To keep the warehouse consistent, changes to upstream tables need to be propagated to downstream tables in a timely fashion. However, a naive change propagation algorithm can cause many unnecessary updates or recalculations of downstream tables, which drives up the cost of data warehouse management. In this paper, we describe our solution that can ensure the eventual consistency of the data warehouse while avoiding unnecessary table updates. We also show that the optimal trade-off between computational cost reduction and meeting data freshness constraints can be found by solving a dynamic programming problem. The proposed solution is currently in production to manage the YouTube Data Warehouse and has reduced update requests by 25% by eliminating non-trivial duplicates. These requests would have been carried out by large batch jobs over big data. Eliminating them has led to a proportionate reduction in computing resources. One key advantage of our approach is that it can be used in a heterogeneous, distributed data warehouse environment where the operator software may not have complete control over the query processors. This is because our approach only relies on having dependency information for tables and can operate on the post-state of data sources.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call