Abstract

Missing data and class imbalance hinder the accurate prediction of rare events such as dairy mastitis. Resampling and imputation are employed to handle these problems. These methods are often used arbitrarily, despite their profound impact on prediction due to changes caused to the data structure. We hypothesize that their use affects the performance of ML models fitted to automated milking systems (AMSs) data for mastitis prediction. We compare three imputations—simple imputer (SI), multiple imputer (MICE) and linear interpolation (LI)—and three resampling techniques: Synthetic Minority Oversampling Technique (SMOTE), Support Vector Machine SMOTE (SVMSMOTE) and SMOTE with Edited Nearest Neighbors (SMOTEEN). The classifiers were logistic regression (LR), multilayer perceptron (MLP), decision tree (DT) and random forest (RF). We evaluated them with various metrics and compared models with the kappa score. A complete case analysis fitted the RF (0.78) better than other models, for which SI performed best. The DT, RF, and MLP performed better with SVMSMOTE. The RF, DT and MLP had the overall best performance, contributed by imputation or resampling (SMOTE and SVMSMOTE). We recommend carefully selecting resampling and imputation techniques and comparing them with complete cases before deciding on the preprocessing approach used to test AMS data with ML models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.