Abstract

Sorting is a fundamental component in many different applications including web indexing engines, geographic information systems, data mining and database systems. However, the existing optimal algorithms still cause huge costs when the input data becomes very large. Thus, the approximate sorting for big data is considered in this paper. The goal of approximate sorting for big data is to generate an approximate sorted result, but using less CPU and I/O cost. For big data, we consider the approximate sorting and applications for approximate sorted results in I/O model.The quality of approximate sorting results is usually measured by the distance metrics on permutation space. However, the existing metrics on permutation space are not available for external approximate sorting algorithms. Thus, we propose a new kind of metric named External metric, which ignores the errors and dislocations that happened in each I/O block. The External metric of Spearman's footrule metric is named as External Spearman's footrule metric (short as ESP metric), which is analyzed in this paper.Furthermore, to facilitate a better evaluation of the approximate sorted result, we propose a new metric, named as errors, which directly states the number of dislocation of the elements. The External errors are also considered in this paper.Then, according to the rate-distortion relationship under these two metrics, the lower bound of these two metrics on external approximate sorting problem with t I/O operations is proved. We propose a k-passes external approximate sorting algorithm, named as EASORT, and prove that EASORT is asymptotically optimal under ESP metric.Finally, we consider the applications on approximate sorting results. An index for the approximate sorted result is proposed. The single and range query on approximate sorted result using this index are also analyzed. Further, the sort-merge join on two relations, where one of the relations is approximate sorted or both relations are approximate sorted, are all discussed in this paper.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call