Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance.

Amichai Painsky,Saharon Rosset

doi:10.1109/tpami.2016.2636831

Abstract

Recursive partitioning methods producing tree-like models are a long standing staple of predictive modeling. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods' inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. We propose a framework to splitting using leave-one-out (LOO) cross validation (CV) for selecting the splitting variable, then performing a regular split (in our case, following CART's approach) for the selected variable. The most important consequence of our approach is that categorical variables with many categories can be safely used in tree building and are only chosen if they contribute to predictive power. We demonstrate in extensive simulation and real data analysis that our splitting approach significantly improves the performance of both single tree models and ensemble methods that utilize trees. Importantly, we design an algorithm for LOO splitting variable selection which under reasonable assumptions does not substantially increase the overall computational complexity compared to CART for two-class classification.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence

Lead the way for us

Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence	Publication Date: Dec 7, 2016
Citations: 69

Similar Papers

Fully Gibbs Sampling Algorithms for Bayesian Variable Selection in Latent Regression Models
Kazuhiro Yamaguchi ... Jihong Zhang
Journal of Educational Measurement | VOL. 60
Kazuhiro Yamaguchi, et. al.Kazuhiro Yamaguchi ... Jihong Zhang
25 Oct 2022
Journal of Educational Measurement | VOL. 60

Empirical Probability Models to Predict Precipitation Levels over Puerto Rico Stations
Nazario D Ramirez-Beltran ... Nazario Ramirez Escalante
Monthly Weather Review | VOL. 135
Nazario D Ramirez-Beltran, et. al.Nazario D Ramirez-Beltran ... Nazario Ramirez Escalante
01 Mar 2007
Monthly Weather Review | VOL. 135

Bio-inspired algorithm for variable selection in i-PLSR to determine physical properties, thorium and rare earth elements in soils from Brazilian semiarid region
Danubio Leonardo Bernardino Oliveira ... Germano Veras
Microchemical Journal | VOL. 160
Danubio Leonardo Bernardino Oliveira, et. al.Danubio Leonardo Bernardino Oliveira ... Germano Veras
15 Oct 2020
Microchemical Journal | VOL. 160

Ranking based variable selection for censored data using AFT models
Md Hasinur Rahaman Khan ... Marzan Akhter
Communications in Statistics - Simulation and Computation | VOL. 53
Md Hasinur Rahaman Khan, et. al.Md Hasinur Rahaman Khan ... Marzan Akhter
22 Jun 2022
Communications in Statistics - Simulation and Computation | VOL. 53

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence