Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

Eric Valdez-Valenzuela,Angel Kuri-Morales,Helena Gomez-Adorno

doi:10.1007/978-3-030-89817-5_7

Eric Valdez-Valenzuela, Angel Kuri-Morales + Show 1 more

https://doi.org/10.1007/978-3-030-89817-5_7

Copy DOI

Abstract

Most of the datasets used in Machine Learning (ML) tasks contain categorical attributes. In practice, these attributes must be numerically encoded for their use in supervised learning algorithms. Although there are several encoding techniques, the most commonly used ones do not necessarily preserve possible patterns embedded in the data when they are applied inappropriately. This potential loss of information affects the performance of ML algorithms in automated learning tasks. In this paper, a comparative study is presented to measure how the different encoding techniques affect the performance of machine learning models. We test 10 encoding methods, using 5 ML algorithms on real and synthetic data. Furthermore, we propose a novel approach that uses synthetically created datasets that allows us to know a priori the relationship between the independent and the dependent variables, which implies a more precise measurement of the encoding techniques’ impact. We show that some ML models are affected negatively or positively depending on the encoding technique used. We also show that the proposed approach is more easily controlled and faster when performing experiments on categorical encoders.

Full Text