Comparison of multiclass classification techniques using dry bean dataset

Md Salauddin Khan,Tushar Deb Nath,Md Murad Hossain,Arnab Mukherjee,Hafiz Bin Hasnath,Tahera Manhaz Meem,Umama Khan

doi:10.1016/j.ijcce.2023.01.002

Abstract

BackgroundThe application of classsification methods through multivariate and machine learning techniques has enormous significance in agricultural sector. It is vital to classify various types of seeds as well as identify the quality of seeds which has a great impact on the production of crops. There is a wide range of genetic variations in dry beans all over the world. Many studies have been conducted previously on various dataset to indentify the sorts of dry beans, however most of them focused on machine learning techniques with binary classification. ObjectiveThe aim of this study is to investigate a reliable classifier which has the lowest noise implications and establish an algorithm for dry bean classification effectively. This paper focuses on outlier removals, oversampling with Adaptive Synthetic (ADASYN) algorithm and finding the best classifier to guarantee the highest possible accuracy. MethodsThe raw dataset for this study was accessed from UCI Machine Learning Repository. The dataset contained grains having 16 features, 12 dimensions, and 4 distinct shapes. For the purpose of eliminating missing values from the dataset, interquartile range (IQR) with python programming was utilized. Eight most popular classifiers were used in this study which are Logistic Regression (LR), Naïve Bayes (NB), k-Nearest Neighbor (KNN), Decision Tree (DT), Random Forest (RF), Extreme Gradient Boosting (XGB), Support Vector Machine (SVM), and Multilayer Perception (MLP) with balanced and imbalanced classes. The authors utilized frequency tables, bar diagrams, boxplots, analysis of variance for descriptive analysis as well as data preprocessing. ResultsThe XGB classifier preferably outperformed than other classifiers with balanced and imbalanced distribution of dry beans within each class. It has acquired accuracy (ACC) 93.0% and 95.4% in imbalanced and balanced classes respectively. In case of balanced dataset, after application of ADASYN algorithm both KNN and RF techniques also performed well regarding the Classification Accuracy (ACC), Sensitivity (SE), Specificity (SP) and Cohen's kappa coefficient (Kappa) etc. The most important attributes for classifying the dry beans were found ShapeFactor2, Minor Axis Length, and ShapeFactor1 along with EquivDiameter, Roundness and ConvexArea. ConclusionsFor classification of dry seeds, the XGB classifier had performed well when the dataset contained both balanced and imbalanced distribution in classes. Also, it is the primary approach of identifying the classes of seeds/beans with balanced or not. If the classes of the target variable are balanced well, then the KNN and RF algorithms may be applied along with XGB technique for more accurate classification.

Full Text