Abstract

This paper focuses on the comparison of dimensionality reduction effect between LightGBM and XGBoost-FA. With respect to XGBoost, LightGBM can be built in the effect of dimensionality reduction via both Gradient-based One-Side Sampling(GOSS) and Exclusive Feature Bundling(EFB) algorithms, while XGBoost coupling with traditional dimensionality reduction tool Factor Analysis (XGBoost-FA) may also have dimensionality reduction effect. To present the empirical comparison, the prediagnosis dataset for the 2018 Kaggle competition Acute Liver Failure has been chosen as the research object. And pairwise comparison has been conducted among XGBoost, LightGBM, XGBoost-FA and LightGBM-FA. Concerning the test set, the vector (accuracy, log loss function, training time) of the above first four prediagnostic models are (0.75014, 0.569707, 10.5s), (0.75811, 0.576059,15.1s), (0.67786,0.663924,5.7s) and (0.67274,0.676019, 4.1s) respectively. It’s been found that the training time of XGBoost-FA (external dimensionality reduction) is shorter than that of LightGBM (build-in dimensionality reduction). Considering (accuracy, training time) being (0.82, 3.1s) published on Kaggle, the algorithm (logogram as K2a) is better than the four XGBoost-FA and LightGBM in both training time and accuracy. However, K2a removes more than 50% samples with missing values and only performs binary classification. For multi-class classification or data with a large number of missing values, XGBoost-FA is more suggested if higher operational time is required, while LightGBM is preferred if higher predictive accuracy is required. With XGBoost-FA or LightGBM being employed in AI medical services, doctors are more productive in diagnosis and treatment due to much more data support and less workload. Both complement each other.

Highlights

  • Based on Gradient Boosting Decision Tree (GBDT), XGBoost and LightGBM are both popular and cutting-edge boosting integrated algorithms in machine learning in recent years

  • Zhang et al compared XGBoost with Artificial Neural Networks (ANN) and Random Forest (RF) in log loss function and training time based on dataset about Acute Liver Failure

  • DIMENSIONALITY REDUCTION RESULTS OF ACUTE LIVER FAILURE DATA BASED ON FACTOR ANALYSIS In dataset of Acute Liver Failure, there are as many as 29 feature variables

Read more

Summary

INTRODUCTION

From Kaggle’s public datasets, this paper chooses one which contains 8785 samples of 8785 adult patients collected by JPAC Health Diagnostics and Control Center in 2014-2015. The first node in the tree indicates that FACT1 is the optimal segmentation variable among 16 common factors in the current data subset, and the corresponding optimal segmentation point is −1.164. Based on the characteristic variables of a sample, this paper can predict whether the patient suffers from acute liver failure, which is helpful for doctors in prediagnosis before medical testing. Based on the prediagnosis model of this paper, the common factors FACT1 is an important indicator to identify acute liver failure (Section VI.), which is a fusion of the three original variables of Weight, Waist and Obesity with a positive impact on it. There is a lack of comprehensive and unified standards for issues such as defects in AI diagnosis and judgment basis for medical negligence

THE BASIC PRINCIPLE OF XGBoost
THE BASIC PRINCIPLE OF LightGBM
THE ANALYSIS IDEAS AND FRAMEWORK OF THIS PAPER
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.