In the billion transistor era only a few architectural approaches propose new paths to improve the execution of conventional sequential instruction streams. Many legacy applications could profit from processors that are able to speed-up the execution of sequential applications beyond the performance of current superscalar processors. The Grid arithmetic logic unit (ALU) Processor (GAP) accelerates conventional sequential instruction streams without the need for recompilation. The GAP comprises a processor front-end similar to that of a superscalar processor extended by a configuration unit and a two-dimensional array of functional units that forms the execution unit. Instruction sequences are mapped dynamically into the array by the configuration unit so that they form the dataflow graph of the sequence. This study shows a performance evaluation of the GAP architecture with different array dimensions as well as its performance using a simplified interconnection network. GAP outperforms an out-of-order superscalar processor by a maximum of factor 2 with a complete crossbar interconnect between two array rows. Reducing the interconnection network to the minimum shows a maximum performance drawback of 10% for only a particular configuration and a single benchmark. In general, the slowdown is less than 2% for the minimum interconnect (two buses) and about 0.02% if three interconnection buses are used.
Read full abstract