Abstract

Database partitioning is a fundamental but challenging task in distributed databases, which selects specific columns as a partitioning key for each table and uses the partitioning key to allocate the table data into different compute nodes in order to maximize the performance. However, this problem is NP-hard and existing distributed databases require users to manually specify the partitioning keys, which may cause potential performance degradation. Although reinforcement learning based methods have been proposed, they have several limitations. First, they do not capture the complex data distributions and query access patterns, and thus involve high computation cost across different compute nodes to answer a query. Second, they involve an expensive step to repetitively partition the data into different compute nodes in order to train a learned key-selection model, which is a waste of time and resources. To address these limitations, we propose a practical learned database partitioning system Grep. We first adopt a graph model to encode data and query features, where vertices are columns, edges are query relations, and the weights of columns are computed based on the localized graph structures (e.g., data diversity, joined columns). We then utilize graph neural networks to embed the partitioning factors into embedding vectors in order to capture the data and query correlations. Next we propose a key-selection model to select appropriate partitioning keys based on the graph model. Finally, we propose an evaluation model to estimate the partitioning performance without actually partitioning the database. We have implemented Grep in a commercial distributed database, and experiments show the effectiveness of our system (e.g., 68% higher throughput for 30K queries in a real banking scenario).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call