Systolic arrays were first introduced by Kung (see, e.g., [2] and [3]) as devices composed of processors of a few different types, which are regularly and locally connected. These processors are activated in a synchronous way by a unique clock which is the only global communication between them. This paper is a continuation of the work presented in [I] where ‘folding’ has been proposed as a technique for the design of systolic arrays. Here we first study the power of folding as a general geometric transformation. Then, as an application we show that two congruent sequences on a ‘regular’ grid can be identified by a limited number of foldings (cf. Theorem 3.5). This result can be used in the design of systolic arrays. Because of the importance of this motivation we start with an example of a systolic array which has been already briefly described in [ 11. Consider Kung and Leiserson’s hex-connected processor array for matrix multiplication [3, pp. 276-2801 modified in such a way that it applies to dense matrices. Fig. 1 illustrates the case where the dimension n of the matrix is equal to 3: the left-to-right and the right-to-left flows correspond to the two matrices A and B to be multiplied and the bottom-to-top flow to the product C = A x B. Each node represents an inner product step processor, i.e., a processor computing one step of a scalar product: s +s + ab (see Fig. 2). Assume we want to use this array to compute the different powers of a matrix A, which basically amounts to computing A, A X A = A2,. . . , Ak X Ak = A2k ).... One solution is to iteratively feed the different outputs of step k coming out in the upper part of the array, to both left and right inputs of step k + 1, i.e., to connect each yi to (Y~ and pi. However, in doing this we would create non-local connections and break the regularity of the layout. Instead, we can first fold the array (as one would fold a sheet of paper) along the axis 1, the righthand side coming on top of the left-hand side. The new array computes the same functions as the original one, very much in the same way as a Turing machine with a one way infinite working tape can simulate a Turing machine with a two-way infinite working tape, by folding the latter. Then, we can fold the new array along the axis 2 (the left-hand side on top of the right-hand side) and again along the axis 3 (the right-hand side on top of the left-hand side). The processors cxi, pi, yi will eventually occupy the same place and we must connect them to each other thus introducing only local connections. As a result the regularity of the initial lay-out is preserved. The price to pay for it is that the new array consists of up to 8 = 23 more
Read full abstract