Efficient String Sort with Multi-Character Encoding and Adaptive Sampling

Wen Jin,Aoying Zhou,Weining Qian

doi:10.1145/3448016.3457319

Abstract

Sorting plays a fundamental role in computer science. It has far reaching applications in database operations and data science tasks. An important class of sorting keys are strings and among all string sorting methods, radix sort is a simple but effective algorithm. Many works have been studied to accelerate radix string sort. One typical approach is to process multiple characters in each sorting pass. However, this approach incurs the crucial issue of the radix being too large. To address the problem, we introduce a novel multi-character encoding based method that can significantly reduce the radix. This new encoding scheme takes advantage of the sparse alphabet space usage as well as the sparsity of distinguishing prefixes of the inputs which are commonly seen in real-world datasets. Combining the effective encoding scheme with an adaptive sampling process to generate the encoding efficiently, our proposed sorting algorithm essentially blends radix sort with sample sort and achieves substantial improvement over other sorting approaches. The results on both real datasets and synthetic datasets show that our method yields an average 4.85× performance improvement over C++ STL sort[21], 1.47× improvement over the state-of-the-art Radix Sort on strings implementation[19] and 2.55× over the multikey quicksort[6]. Preliminary tests in a multi-core environment also show it is competitive or better than the most recent parallel string sorting algorithm pS5[8] which demonstrates the scalability of our method.

Full Text