Abstract

We study GroupBy implementation scheme which is widely used in distributed systems and databases. The GroupBy operation partitions a set of out-of-order records into groups. Due to the massive data size, many I/O-efficient grouping schemes that exploit external memory have been proposed. In this paper, we observe that the group sizes of many real data exhibit power-law property and the grouping schemes’ performance varies a lot for data with different group sizes. The indexing–filling approach prefers data with big group size, while the partitioned hash approach prefers data with small group size. Based on this observation, we propose a hybrid approach, PowerHash, which invokes different grouping schemes for different data. The group size information is approximately estimated by the count-min sketch so that the big groups and small groups can be distinguished from each other. With a given memory budget, our results show that PowerHash can improve performance by up to six times over the existing GroupBy implementations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call