LGBM-CBFS: A Heuristic Feature Sampling Method Based on Tree Ensembles

Yu Zhou,Mei Chen,Hui Li

doi:10.1155/2022/5156086

Yu Zhou, Mei Chen + Show 1 more

Open Access

https://doi.org/10.1155/2022/5156086

Copy DOI

Journal: Security and Communication Networks	Publication Date: Mar 16, 2022
Citations: 1	License type: CC BY 4.0

Affiliation: Guizhou University

Abstract

Gradient boosting decision tree (GBDT) is widely used because of its state-of-art performance in academia, industry, and data science competitions. The efficiency of the model is limited by the overwhelming training cost with the surge of data. A common solution is data reduction by sampling on training data. Current popular implementations of GBDT such as XGBoost and LightGBM both supports cut the search space by using only a random subset of features without any prior knowledge, which is ineffective and may lead the model fail to converge when sampling on a high-dimensional feature space with a small sampling rate assigned. To mitigate this problem, we proposed a heuristic sampling algorithm LGBM-CBFS, which samples features based on an available prior knowledge named “importance scores” to improve the performance and the effectiveness of GBDT. Experimental results indicate that LGBM-CBFS obtains a higher level of model accuracy than uniform sampling without introducing unacceptable time cost in the sparse high-dimensional scenarios.

Highlights

Gradient boosting decision tree [1] achieves state-of-the-art performance in machine learning and data mining applications ranging from classification, regression to ranking.e rapid growth of data volume leads to expensive training cost. e major computation cost of training the GBDT model comes from building a decision tree in each gradient boosting round, in which finding the optimal split point results in the most time-consuming operation because it requires scanning entire training data for each feature
E rapid growth of data volume leads to expensive training cost. e major computation cost of training the GBDT model comes from building a decision tree in each gradient boosting round, in which finding the optimal split point results in the most time-consuming operation because it requires scanning entire training data for each feature
We integrate our LGBM-GBFS algorithm into LightGBM and conduct a series of experiments to verify the correctness and effectiveness of the approach. e experimental results show that our algorithm significantly improved the effectiveness when sampling features on high-dimensional sparse data, and it works better than uniform sampling without additional heavy computational cost

Summary

Introduction

Gradient boosting decision tree [1] achieves state-of-the-art performance in machine learning and data mining applications ranging from classification, regression to ranking. Two open-source projects XGBoost [2] and LightGBM [3] have been widely used for their superiority in machine learning and data analytic applications, which exploit histogram-based data partitioning to accelerate decision tree building Both of them still suffer from performance degradation in large-scale learning tasks at a limited time and economic cost. The representative open-source implementations of GBDT such as XGBoost and LightGBM used uniform sampling to choose a subset of features It is a simple and effective way to reduce the dimensions of feature space with a low time consumption. Scoresbased nonuniform sampling [12, 13] will obtain further performance improvement by picking a representative subset of features according to a metric that indicates the importance of the features It will bring additional time costs due to the calculation of scores introduced. We integrate our LGBM-GBFS algorithm into LightGBM and conduct a series of experiments to verify the correctness and effectiveness of the approach. e experimental results show that our algorithm significantly improved the effectiveness when sampling features on high-dimensional sparse data, and it works better than uniform sampling without additional heavy computational cost

Review of GBDT

Sampling Schemes in Ensemble Learning

Contribution-Based Feature Sampling Algorithm

Importance Scores in High-Dimensional Data

LGBM-CBFS : LightGBM with Contribution-Based Feature Sampling

Experiments

Accuracy Evaluation

Efficiency Evaluation

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

LGBM-CBFS: A Heuristic Feature Sampling Method Based on Tree Ensembles

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Security and Communication Networks

Lead the way for us

Similar Papers

Who will Win the Data Science Competition? Insights from KDD Cup 2019 and Beyond
Hao Liu ... Hengshu Zhu
ACM Transactions on Knowledge Discovery from Data | VOL. 16
Hao Liu, et. al.Hao Liu ... Hengshu Zhu
05 Apr 2022
Who will Win the Data Science Competition? Insights from KDD Cup 2019 and Beyond
Hao Liu ... Hengshu Zhu

Unsupervised Domain Adaptation for Static Malware Detection based on Gradient Boosting Trees
Panpan Qi ... See Kiong Ng
-
Panpan Qi, et. al.Panpan Qi ... See Kiong Ng
26 Oct 2021
26 Oct 2021

Evaluation of machine learning models for rice dry biomass estimation and mapping using quad-source optical imagery
Lamin R Mansaray ... Fumin Wang
GIScience & Remote Sensing | VOL. 57
Lamin R Mansaray, et. al.Lamin R Mansaray ... Fumin Wang
29 Jul 2020
GIScience & Remote Sensing | VOL. 57

Machine learning constructs a diagnostic prediction model for calculous pyonephrosis
Bin Yang ... Jianhe Liu
Urolithiasis | VOL. 52
Bin Yang, et. al.Bin Yang ... Jianhe Liu
19 Jun 2024
Urolithiasis | VOL. 52

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

LGBM-CBFS: A Heuristic Feature Sampling Method Based on Tree Ensembles

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Security and Communication Networks