TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory

Jaehyun Park,Eojin Lee,Minsoo Rhu,Sungmin Yun,Jung Ho Ahn,Byeongho Kim

doi:10.1145/3466752.3480080

Abstract

Personalized recommendation systems are gaining significant traction due to their industrial importance. An important building block of recommendation systems consists of the embedding layers, which exhibit a highly memory-intensive characteristic. A fundamental primitive of embedding layers is the embedding vector gathers followed by vector reductions, exhibiting low arithmetic intensity and becoming bottlenecked by the memory throughput. To tackle such a challenge, recent proposals employ a near-data processing (NDP) solution at the DRAM rank-level, achieving impressive performance speedups. We observe that prior rank-level-parallelism-based NDP solutions leave significant performance potential on the table as they do not fully reap the abundant transfer throughput inherent in DRAM datapaths. We propose TRiM, an NDP architecture for accelerating recommendation systems. Based on the observation that the DRAM datapath has a hierarchical tree structure, TRiM augments the DRAM datapath with “in-DRAM” reduction units at the DDR4/5 rank/bank-group/bank level. We modify the interface of DRAM to provide commands effectively to multiple reduction units running in parallel. We also propose a host-side architecture with hot embedding-vector replication to alleviate the load imbalance that arises across the reduction units. An optimal TRiM design based on DDR5 achieves up to a 7.7 × and 3.9 × speedup and reduces by 55% and 50% the energy consumption of the embedding vector gather and reduction over the baseline and the state-of-the-art NDP architecture with minimal area overhead equivalent to 2.66% of DRAM chips.

Full Text