Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core Systems

Karan Aggarwal,Uday Bondhugula

doi:10.1145/3418075

Abstract

Sparse matrix-vector multiplication ( SpMV ) operations are commonly used in various scientific and engineering applications. The performance of the SpMV operation often depends on exploiting regularity patterns in the matrix. Various representations and optimization techniques have been proposed to minimize the memory bandwidth bottleneck arising from the irregular memory access pattern involved. Among recent representation techniques, tensor decomposition is a popular one used for very large but sparse matrices. Post sparse-tensor decomposition, the new representation involves indirect accesses, making it challenging to optimize for multi-cores and even more demanding for the massively parallel architectures, such as on GPUs. Computational neuroscience algorithms often involve sparse datasets while still performing long-running computations on them. The Linear Fascicle Evaluation (LiFE) application is a popular neuroscience algorithm used for pruning brain connectivity graphs. The datasets employed herein involve the Sparse Tucker Decomposition (STD)—a widely used tensor decomposition method. Using this decomposition leads to multiple indirect array references, making it very difficult to optimize on both multi-core and many-core systems. Recent implementations of the LiFE algorithm show that its SpMV operations are the key bottleneck for performance and scaling. In this work, we first propose target-independent optimizations to optimize the SpMV operations of LiFE decomposed using the STD technique, followed by target-dependent optimizations for CPU and GPU systems. The target-independent techniques include: (1) standard compiler optimizations to prevent unnecessary and redundant computations, (2) data restructuring techniques to minimize the effects of indirect array accesses, and (3) methods to partition computations among threads to obtain coarse-grained parallelism with low synchronization overhead. Then, we present the target-dependent optimizations for CPUs such as: (1) efficient synchronization-free thread mapping and (2) utilizing BLAS calls to exploit hardware-specific speed. Following that, we present various GPU-specific optimizations to optimally map threads at the granularity of warps, thread blocks, and grid. Furthermore, to automate the CPU-based optimizations developed for this algorithm, we also extend the PolyMage domain-specific language, embedded in Python. Our highly optimized and parallelized CPU implementation obtains a speedup of 6.3× over the naive parallel CPU implementation running on 16-core Intel Xeon Silver (Skylake-based) system. In addition to that, our optimized GPU implementation achieves a speedup of 5.2× over a reference-optimized GPU code version on NVIDIA’s GeForce RTX 2080 Ti GPU and a speedup of 9.7× over our highly optimized and parallelized CPU implementation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core Systems

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Parallel Computing

Lead the way for us

Journal: ACM Transactions on Parallel Computing	Publication Date: Nov 25, 2020
Citations: 1

Similar Papers

Noninvasive fetal ECG extraction using doubly constrained block-term decomposition.
Iman Mousavian ... Emad Fatemizadeh
Mathematical Biosciences and Engineering | VOL. 17
Iman Mousavian, et. al.Iman Mousavian ... Emad Fatemizadeh
26 Sep 2019
Mathematical Biosciences and Engineering | VOL. 17

Optimizing the linear fascicle evaluation algorithm for many-core systems
Karan Aggarwal ... Uday Bondhugula
-
Karan Aggarwal, et. al.Karan Aggarwal ... Uday Bondhugula
26 Jun 2019
26 Jun 2019

Characterizing the impact of soft errors on iterative methods in scientific computing
Manu Shantharam ... Padma Raghavan
-
Manu Shantharam, et. al.Manu Shantharam ... Padma Raghavan
31 May 2011
31 May 2011

Optimizing SpMV for Diagonal Sparse Matrices on GPU
Xiangzheng Sun ... Yunquan Zhang
-
Xiangzheng Sun, et. al.Xiangzheng Sun ... Yunquan Zhang
01 Sep 2011
01 Sep 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core Systems

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Parallel Computing