Decision forests, particularly Gradient Boosted Decision Trees (GBDT), are popular due to their high prediction performance and computational efficiency, making them suitable for embedded systems with circuit size and available energy constraints. In this study, we propose a new lightweight GBDT inference acceleration mechanism through the hardware and algorithm co-design. First, we present LoADPack, a hardware-friendly GBDT algorithm that enhances memory access locality. LoADPack obtains trees where the features and thresholds used across the entire ensemble are regular regardless of a branching direction by unifying some nodes and aligning the memory access patterns. Furthermore, we present DF-BETA, a resource-efficient accelerator for the LoADPack algorithm. DF-BETA utilizes MSB-first bit-serial computation to enable early determination of comparison calculations of 32-bit floating-point numbers, optimizing the operation for determining a branch direction. The hardware complexity and computation termination speed vary with the granularity of bit-serial computation. Therefore, we conduct design space exploration of DF-BETA to identify the optimal configuration. Our findings reveal that using 4-bit-serial comparators minimizes circuit size while achieving the leading throughput. Compared to running unconstrained GBDT on a typical accelerator with 32-bit bit-parallel comparators, our accelerator achieves 1.6 times higher throughput on average while maintaining comparable accuracy.
Read full abstract