Abstract

In this paper we address for the first time the I/O complexity of the problem of sorting strings in external memory, which is a fundamental component of many large-scale text applications. In the standard unit-cost RAM comparison model, the complexity of sorting K strings of total length N is (K log2K+N). By analogy, in the external memory (or I/O) model, where the internal memory has size M and the block transfer size is B, it would be natural to guess that the I/O complexity of sorting strings is (KB logM=B KB + NB ), but the known algorithms do not come even close to achieving this bound. Our results show, somewhat counterintuitively, that the I/O complexity of string sorting depends upon the length of the strings relative to the block size. We first consider a simple comparison I/O model, where one is not allowed to break the strings into their characters, and we show that the I/O complexity of string sorting in this model is (N1 B logM=B N1 B +K2 logM=BK2+NB ), whereN1 is the total length of all strings shorter than B andK2 is the number of strings longer than B. We then consider two more general I/O comparison models in which string breaking is allowed. We obtain improved algorithms and in several cases lower bounds that match their I/O bounds. Finally, we develop more practical algorithms without assuming the comparison model. Department of Computer Science, Duke University, Durham, NC 27708–0129, USA. Email: large@cs.duke.edu. Supported in part by the U.S. Army Research Office under grant DAAH04–96–1–0013 and by the ESPRIT Long Term Research Programme under project 20244 (ALCOM–IT). Part of this work was done while at BRICS, Dept. of Computer Science, University of Aarhus, Denmark, and while visiting Universita di Firenze. y Dipartimento di Informatica, Universita di Pisa, Pisa, Italy. Email: ferragin@di.unipi.it. Supported in part by MURST of Italy. z Dipartimento di Sistemi e Informatica, Universita di Firenze, Firenze, Italy. Email: grossi@dsi2.dsi.unifi.it. Part of this work was done while visiting BRICS, University of Aarhus, Denmark. x Department of Computer Science, Duke University, Durham, NC 27708–0129, USA. Email: jsv@cs.duke.edu. Supported in part by the U.S. Army Research Office under grants DAAH04–93–G–0076 and DAAH04–96–1–0013 and by the National Science Foundation under grant CCR–9522047.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call