Improving classification performance of extreme gradient boosting on small-sized dataset to classify Turkish and Italian wines along with elemental profiling by inductively coupled plasma-mass spectrometry

Hande Alp,Orkun Alp

doi:10.1080/00387010.2021.2008977

Abstract

In this study, the classification performance of the extreme gradient boosting algorithm on a small-sized dataset was improved by using a synthetically generated dataset created with kernel density estimation to classify wine samples. The concentration of 29 elements in wine samples produced in Turkey (domestic) and Italy (imported) was determined by inductively coupled plasma-mass spectrometry and obtained results were used to generate the dataset. Classification of wine samples was firstly assessed with extreme gradient boosting, which is known for overfitting in small-sized datasets, resulting in poor classification performance. To improve the classification performance, a synthetic dataset was created and the algorithm was trained on the synthetic dataset instead of the original dataset. With the proposed method, the accuracy of the model was improved from 76.7% to 81.7%. The precision values for Turkish and Italian wines were increased from 78.4% to 84.1% and from 70.9% to 79.4%, respectively. The variable importance determined by the extreme gradient boosting algorithm showed that beryllium and cesium were significantly more important compared to other elements followed by tin, phosphorus, cobalt, lead, calcium, copper, zinc, and aluminum as the top 10 elements to classify Turkish and Italian wine samples.

Full Text