External sorting is very expensive; the problem of sorting more data items than the number of PEs is thus important. Let us assume that each PE has m memory locations for storing data items in addition to some working registers. We adapt Batcher's bitonic sort [1] to process k* N data items, where k ~< m, on a two-dimensional mesh-connected parallel computer with N processing elements. Performances are analyzed and compared between two algorithms of different data folding schemes. On a two-dimensional mesh, the location of each PE is uniquely represented by an ordered pair (i, j) of integers. The PE placed at the coordinates (i, j) is generally connected to four other PEs at (i + 1, j), (i 1, j), (i, j + 1), and (i, j 1). Assuming no 'wrap-around' connections, those PEs at the boundaries are connected only to two or three other PEs. In this paper we shall consider a SIMD computer of a square, mesh-connected array of N PEs. A sorting problem on a parallel computer can be defined as reordering data items according to a pre-determined PE indexing. The PE index function can be thought of as a one-to-one mapping from the coordinate space {0, 1 . . . . . ~ } x {0, 1 . . . . . ~ } onto the index space {0, 1 . . . . . N 1}. Discussed in [6] are three PE index functions: (1) row-major indexing, (2) shuffled row-major indexing, and (3) snake-like row-major indexing. Examples of these index functions are shown in Fig. 1 for a 4 x 4 mesh. Batcher's bitonic sort, based on pair-wise compare-and-interchange operations, can be informally represented as a sorting network [4]. As shown in Fig. 2, it requires a total of log N merging stages to sort a sequence of N elements by merging sub-sequences of lengths 1, 2 . . . . . and so on. (We use the notation log for log 2 throughout this paper.) Two adapted algorithms of the bitonic sort can be found in [6,5] which require O(¢'-N-) time to sort N data items using N PEs on a mesh-connected parallel computer. One maps the bitonic sort relatively straightforwardly onto a computer with the shuffled row-major PE indexing [6]. The other adapts, with some additional complications, the bitonic sort to a computer with the row-major PE indexing [5]. Knuth [4] pointed out that any sorting network for N data items can be generalized to sort k * N data items if the comparison operations are replaced by k-way merge operations. One implementation, using merge-splitting operations, of Knuth 's generalization was proposed for several sorting algorithms by
Read full abstract