Abstract

Gradient boosting decision tree (GBDT) is widely used because of its state-of-art performance in academia, industry, and data science competitions. The efficiency of the model is limited by the overwhelming training cost with the surge of data. A common solution is data reduction by sampling on training data. Current popular implementations of GBDT such as XGBoost and LightGBM both supports cut the search space by using only a random subset of features without any prior knowledge, which is ineffective and may lead the model fail to converge when sampling on a high-dimensional feature space with a small sampling rate assigned. To mitigate this problem, we proposed a heuristic sampling algorithm LGBM-CBFS, which samples features based on an available prior knowledge named “importance scores” to improve the performance and the effectiveness of GBDT. Experimental results indicate that LGBM-CBFS obtains a higher level of model accuracy than uniform sampling without introducing unacceptable time cost in the sparse high-dimensional scenarios.

Highlights

  • Gradient boosting decision tree [1] achieves state-of-the-art performance in machine learning and data mining applications ranging from classification, regression to ranking.e rapid growth of data volume leads to expensive training cost. e major computation cost of training the GBDT model comes from building a decision tree in each gradient boosting round, in which finding the optimal split point results in the most time-consuming operation because it requires scanning entire training data for each feature

  • E rapid growth of data volume leads to expensive training cost. e major computation cost of training the GBDT model comes from building a decision tree in each gradient boosting round, in which finding the optimal split point results in the most time-consuming operation because it requires scanning entire training data for each feature

  • We integrate our LGBM-GBFS algorithm into LightGBM and conduct a series of experiments to verify the correctness and effectiveness of the approach. e experimental results show that our algorithm significantly improved the effectiveness when sampling features on high-dimensional sparse data, and it works better than uniform sampling without additional heavy computational cost

Read more

Summary

Introduction

Gradient boosting decision tree [1] achieves state-of-the-art performance in machine learning and data mining applications ranging from classification, regression to ranking. Two open-source projects XGBoost [2] and LightGBM [3] have been widely used for their superiority in machine learning and data analytic applications, which exploit histogram-based data partitioning to accelerate decision tree building Both of them still suffer from performance degradation in large-scale learning tasks at a limited time and economic cost. The representative open-source implementations of GBDT such as XGBoost and LightGBM used uniform sampling to choose a subset of features It is a simple and effective way to reduce the dimensions of feature space with a low time consumption. Scoresbased nonuniform sampling [12, 13] will obtain further performance improvement by picking a representative subset of features according to a metric that indicates the importance of the features It will bring additional time costs due to the calculation of scores introduced. We integrate our LGBM-GBFS algorithm into LightGBM and conduct a series of experiments to verify the correctness and effectiveness of the approach. e experimental results show that our algorithm significantly improved the effectiveness when sampling features on high-dimensional sparse data, and it works better than uniform sampling without additional heavy computational cost

Review of GBDT
Sampling Schemes in Ensemble Learning
Contribution-Based Feature Sampling Algorithm
Importance Scores in High-Dimensional Data
LGBM-CBFS : LightGBM with Contribution-Based Feature Sampling
Experiments
Accuracy Evaluation
Efficiency Evaluation
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.