Recursive algorithms, like quick-sort and recursive data structures, like trees, play a central role in programming. In the context of scientific computing, recursive algorithms and memory layouts are shown here to provide excellent cache and Translation Lookaside Buffer (TLB) locality independently of the platform. We show how, for the first time, generic programming and object-oriented programming allow us to abstract a multitude of dense-matrix memory layouts: from conventional row-major and column-major layouts over Z- and И-Morton orders to block-wise combinations of them. All are provided by a single class that is based on our new matrix abstraction. The algorithmic recursion is supported in generic fashion by classes modelling the new recursator, an analogue of the Standard Template Library iterator. Although this concept supports recursion in general, we focus again on matrix operations. Results are presented for matrix multiplication, on both conventional and tiled representations using both homogeneous and heterogeneous matrix representations. Reaching about 60% peak performance in portable C++ code establishes competitive performance without explicit prefetching and other platform-specific tuning. Comparisons with the manufacturers' libraries show superior locality. These new techniques are embedded in the Matrix Template Library, Version 4 (MTL4).
Read full abstract