Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Florian Pargent,Florian Pfisterer,Bernd Bischl,Janek Thomas

doi:10.1007/s00180-022-01207-6

Florian Pargent, Florian Pfisterer + Show 2 more

Open Access

https://doi.org/10.1007/s00180-022-01207-6

Copy DOI

Abstract

Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computational Statistics	Publication Date: Mar 4, 2022
Citations: 48	License type: open-access

R Discovery Prime

R Discovery Prime

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Abstract

Talk to us

Similar Papers

More From: Computational Statistics

Lead the way for us

Similar Papers

Machine Learning Prediction of Liver Allograft Utilization From Deceased Organ Donors Using the National Donor Management Goals Registry.
Andrew M Bishara ... Dieter Adelmann
Transplantation Direct | VOL. 7
Andrew M Bishara, et. al.Andrew M Bishara ... Dieter Adelmann
27 Sep 2021
Transplantation Direct | VOL. 7

Impact of categorical and numerical features in ensemble machine learning frameworks for heart disease prediction
Chandan Pan ... Ajoy Kumar Ray
Biomedical Signal Processing and Control | VOL. 76
Chandan Pan, et. al.Chandan Pan ... Ajoy Kumar Ray
05 Apr 2022
Biomedical Signal Processing and Control | VOL. 76

Comparison of Random Forest and Gradient Boosting Machine Models for Predicting Demolition Waste Based on Small Datasets and Categorical Variables.
Gi-Wook Cha ... Young-Chan Kim
International Journal of Environmental Research and Public Health | VOL. 18
Gi-Wook Cha, et. al.Gi-Wook Cha ... Young-Chan Kim
12 Aug 2021
International Journal of Environmental Research and Public Health | VOL. 18

Performance evaluation of supervised learning algorithms with various training data sizes and missing attributes
...
Engineering and Applied Science Research | VOL. 45
, et. al. ...
14 Sep 2018
Engineering and Applied Science Research | VOL. 45

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Abstract

Talk to us

Similar Papers

More From: Computational Statistics