Abstract
Recently, massive data management plays an increasingly important role in data analytics because data access is a major bottleneck. Data skipping is a promising technique to reduce the number of data accesses. Data skipping partitions data into pages and accesses only pages that contain data to be retrieved by a query. Therefore, effective data partitioning is required to minimize the number of page accesses. However, it is an NP-hard problem to obtain optimal data partitioning given query pattern and data distribution. We propose a framework that involves a multidimensional indexing technique based on a space-filling curve. A space-filling curve is a way to define which portion of data can be stored in the same page. Therefore, the problem can be interpreted as selecting a curve that distributes data to be accessed by a query to minimize the number of page accesses. To solve this problem, we analyzed how different space-filling curves affect the number of page accesses. We found that it is critical for a curve to fit a query pattern and be robust against any data distribution. We propose a cost model for measuring how well a space-filling curve fits a given query pattern and tolerates data skew. Also we propose a method for designing a query-aware and skew-tolerant curve for a given query pattern. We prototyped our framework using the defined query-aware and skew-tolerant curve. We conducted experiments using a skew data set, and confirmed that our framework can reduce the number of page accesses by an order of magnitude for data warehousing (DWH) and geographic information systems (GIS) applications with real-world data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.