Abstract
Sparse matrix–vector multiplication (SpMV) is an important kernel that is widely used in science and engineering applications. The features of SpMV, such as high memory-intensiveness and many different access patterns, cause the performance of SpMV to be bounded by the limited bandwidth between memory and processing units. Processing in memory (PIM) is a novel architecture used to overcome the bandwidth bottleneck by shortening the distance between processing elements (PE) and memory. In this paper, we propose a PIM-structure SpMV accelerator based on high-bandwidth memory (HBM). To make full use of the high bandwidth provided by HBM, we design a highly parallel PE array and implement a high-frequency pipeline inside the PE to hide the latency of reading matrix elements from HBM. For each PE, we integrate an L1 cache to exploit the data locality in the vector. We propose two data layout strategies, namely a row merging algorithm to exploit the inter-row data locality and a row assignment algorithm to achieve workload balance among PEs. Our design is implemented using a field programmable gate array (FPGA) card with 8GB HBM2 memory. Compared to the baseline central processing unit (CPU) SpMV implementation, our accelerator can obtain a 5.24x performance speedup on average.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.