Abstract
AbstractOwing to the widespread advancement of transformer‐based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix–vector multiplication in addition to the conventional matrix–matrix multiplication. However, current AI processor architectures are optimized for general matrix–matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix–vector multiplications (GEMVs). In this study, we proposed a port‐folding GEMV (PF‐GEMV) scheme employing multiformat and low‐precision techniques while reusing an outer product‐based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8‐bit format on an 8 8 processor, thus resulting in a 7.5 increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT‐2 large model, an increase in speed by 7 is achieved in single‐batch inferences.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.