Abstract
Maintaining statistics on multidimensional data distributions is crucial for predicting the run-time and result size of queries and data analysis tasks with acceptable accuracy. Applications of such predictions include traditional query optimization, priority management and resource scheduling for data mining tasks, as well as querying heterogeneous Web data sources with diverse information quality. To this end a plethora of techniques have been proposed for maintaining a compact data “synopsis” on a single table, ranging from variants of histograms to methods based on wavelets and other transforms. However, the fundamental question of how to reconcile the synopses for large information sources with many tables has been largely unexplored. This paper develops a general framework for reconciling the synopses on many tables, which may come from different information sources. It shows how to compute an optimal combination of synopses for a given workload and a limited amount of available memory. As the exact solution has large computational complexity, efficient heuristics are presented for limiting the search space of synopses combinations. The practicality of the approach and the accuracy of the proposed heuristics are demonstrated by experiments.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.