Abstract

AbstractThis paper aims to create a new hybrid ensemble data mining model to predict the Salmonella presence in agricultural surface waters based on the combination of heterogeneous ensemble approach for feature selection, clustering, regression, and classification algorithms. The data set for this study was collected from six agricultural ponds in Central Florida consisting of 23 features with 540 instances (26 Salmonella positive and 514 Salmonella negative). The model consisted of three stages. Initially, a heterogeneous ensemble feature selection (HEFS) approach was applied to select top features. Then, the k‐means clustering algorithm was implemented to remove misclassified cases from the data set. Finally, classification and regression algorithms, including support vector machine (SVM), Naïve Bayes (NB), Artificial Neural Network (ANN), Random Forest (RF) with soft voting approach were applied to the preprocessed data set to predict the Salmonella presence in agricultural surface waters with the amount of test set (20%). These algorithms were combined in 10 different ensemble models through the soft voting approach. The performance of these hybrid ensemble models was also evaluated. The ensemble ANN + RF model achieved the highest performance and outperformed all other single and ensemble models based on Area under the ROC Curve (AUC) (0.98) and prediction accuracy (94.9%). The findings emphasize the validity of our hybrid ensemble model which encourages researchers to predict Salmonella presence in agricultural surface waters.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call