Abstract

Deep learning (DL) model training must address the memory bottleneck to continue scaling. Processing-in-memory approaches can be a viable solution as they move computations near or into the memory, reducing substantial data movement. However, to deploy applications on such hardware, end-to-end software support is crucial for efficient computation mapping and scheduling as well as extensible code generation, but no consideration has been made for DL training workloads. In this paper, we propose XLA-NDP, a compiler and runtime solution for NDPX, a near-data processing (NDP) architecture integrated with an existing DL training framework. XLA-NDP offloads NDPX kernels and schedules them to overlap with GPU kernels to maximize parallelism based on GPU and NDPX costs, while providing a template-based code generator with low-level optimizations. The experiments showed that XLA-NDP provides up to 1.41x speedup (1.24x on average) over the GPU baseline for four DL model training.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call