Unroll-and-jam using uniformly generated sets

S Carr,Yiping Guan Yiping Guan

doi:10.1109/micro.1997.645832

Abstract

Modern architectural trends in instruction-level parallelism (ILP) are to increase the computational power of microprocessors significantly. As a result, the demands on memory have increased. Unfortunately, memory systems have not kept pace. Even hierarchical cache structures are ineffective if programs do not exhibit cache locality. Because of this compilers need to be concerned not only with finding ILP to utilize machine resources effectively, but also with ensuring that the resulting code has a high degree of cache locality. One compiler transformation that is essential for a compiler to meet the above objectives is unroll-and-jam, or outer-loop unrolling. Previous work either has used a dependence-based model to compute unroll amounts, significantly increasing the size of the dependence graph, or has applied a more brute force technique. In this paper, we present an algorithm that uses a linear-algebra-based technique to compute unroll amounts. This technique results in an 84% reduction over dependence-based techniques in the total number of dependences needed in our benchmark suite. Additionally, there is no loss in optimization performance over previous techniques and a more elegant solution is utilized.

Full Text