Booster: An Accelerator for Gradient Boosting Decision Trees Training and Inference

Mingxuan He,Mithuna Thottethodi,T N Vijaykumar

doi:10.1109/ipdps53621.2022.00106

Abstract

Recent breakthroughs in machine learning (ML) have sparked hardware innovation for efficient execution of the emerging ML workloads. For instance, due to recent refine-ments and high-performance implementations, well-established gradient boosting decision tree (GBT) models (e.g., XGBoost) have demonstrated their dominance in commercially-important contexts, such as table-based datasets (e.g., relational databases and spreadsheets). Unfortunately, GBT training and inference are time-consuming (e.g., several hours of training for large datasets). Despite their importance, GBTs have not been targeted for hardware acceleration as much as neural networks. We propose Booster, a novel accelerator for GBTs based on their unique characteristics. We observe that the dominant steps of GBT training and inference (accounting for 90-98% of time) involve simple, fine-grained, independent operations on small-footprint data structures (e.g., histograms and shallow trees) - i.e., GBT is on-chip memory bandwidth-bound. Unfortunately, existing multicores and GPUs do not support massively-parallel data structure accesses that are irregular and data-dependent. By employing a scalable sea-of-small-SRAMs approach and an SRAM bandwidth-preserving mapping of data record fields to the SRAMs called group-by-field mapping, Booster achieves significantly more parallelism (e.g., 3200-way parallelism) than multicores and GPUs. In addition, Booster employs a redun-dant data representation that significantly lowers the memory bandwidth demand. Our simulations reveal that Booster achieves 11.4x and 6.4x speedups for training, and 45x and 22x (21x and 11x) speedups for offline (online) inference, over an ideal 32-core multicore and an ideal GPU, respectively. Based on ASIC synthesis of FPGA-validated RTL using 45 nm technology, we estimate a Booster chip to occupy 60 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> of area and dissipate 23 W when operating at 1-G Hz clock speed.

Full Text