Abstract
We developed a methodology to improve the cache behavior and overall performance of sparse linear algebra kernels used in graph analytics. Large scale graph processing typically has low performance because it cannot effectively use processor caches, resulting in high-latency for memory accesses. This is particularly true in a sparse linear algebra formulations of graph algorithms, which use well known kernels such as sparse-matrix vector multiply and sparse-matrix transposition. The proposed methodology partitions the inner loops of linear algebra kernels, and executes the iterations over these partitions in an independently ordered manner. This produces more cache-friendly access patterns, which reduce cache misses and improve performance. We first illustrate the technique through an example and then generalize it to multiple scenarios. Our evaluation shows that execution time of key computational kernels is reduced by 2- to 5-fold, when large graphs are considered. Most of the improvement in execution time comes from a significant reduction in cycles-per-instruction (CPI), even as the path length of the executed code increases.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.