Abstract

Gradient Boosted Decision Trees (GBDT) is a practical machine learning method, which has been widely used in various application fields such as recommendation system. Optimizing the performance of GBDT on heterogeneous many-core processors exposes several challenges such as designing efficient parallelization scheme and mitigating the latency of irregular memory access. In this paper, we propose swGBDT, an efficient GBDT implementation on Sunway processor. In swGBDT, we divide the 64 CPEs in a core group into multiple roles such as loader, saver and worker in order to hide the latency of irregular global memory access. In addition, we partition the data into two granularities such as block and tile to better utilize the LDM on each CPE for data caching. Moreover, we utilize register communication for collaboration among CPEs. Our evaluation with representative datasets shows that swGBDT achieves 4.6\(\times \) and 2\(\times \) performance speedup on average compared to the serial implementation on MPE and parallel XGBoost on CPEs respectively.

Highlights

  • In recent years machine learning has gained great popularity as a powerful technique in the field of big data analysis

  • We compare the performance of our swGBDT with serial implementation on Management Processing Element (MPE) and parallel XGBoost [3] on Computation Processing Elements (CPEs)

  • The serial implementation is the naive implementation of our Gradient Boosted Decision Tree (GBDT) algorithm without using CPEs

Read more

Summary

Introduction

In recent years machine learning has gained great popularity as a powerful technique in the field of big data analysis. Gradient Boosted Decision Tree (GBDT) [6] is a widely used machine learning technique for analyzing massive data with various features and sophisticated dependencies [17]. The GBDT is an ensemble machine learning model that requires training of multiple decision trees sequentially. The MPE is in charge of task scheduling whose structure is similar to mainstream processors, while CPEs are designed for high computing output with 16 KB L1 instruction caches and 64 KB programmable Local Device Memories (LDMs). There are two methods for memory access from main memory in the CG to a LDM in the CPE: DMA and global load/store (gld/gst). DMA is of much higher bandwidth compared to gld/gst for contiguous memory access. The SW26010 architecture introduces efficient and reliable register communication mechanism for communication between CPEs within the same row or column which has even higher bandwidth than DMA

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call