Utilization of synthetic minority oversampling technique for improving potato yield prediction using remote sensing data and machine learning algorithms with small sample size of yield data

Hamid Ebrahimy,Yi Wang,Zhou Zhang

doi:10.1016/j.isprsjprs.2023.05.015

Hamid Ebrahimy, Yi Wang + Show 1 more

Open Access

PDF Available

https://doi.org/10.1016/j.isprsjprs.2023.05.015

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

In recent years, the integration of machine learning (ML) algorithms and remote sensing data has been the commonly deployed practice for potato yield prediction in different scales. Since the quantity and quality of training data significantly affect ML algorithms' applicability, their effective use in some cases can be challenging and expensive. In this paper, we utilized the synthetic minority oversampling technique (SMOTE) algorithm to generate synthetic data for potato yield prediction. We conducted several experiments in two study sites called CS1 and CS2. The SMOTE algorithm was employed to produce synthetic data at five multiplication rates (5, 10, 20, 40, and 80). Six ML algorithms including random forest regression (RFR), support vector regression (SVR), K- nearest neighbor (KNN), extreme gradient boosting (XGB), deep neural network (DNN), and stacked auto-encoder of neural network (SAE) were used for potato yield prediction. To train the ML algorithms, multiple sets of synthetically generated data were combined with the original data. The similarity of synthetic data and original data was evaluated by two metrics (Kullback-Leibler divergence (KLD) and Jensen-Shannon divergence (JSD)), as well as PCA-based visualization. On the other hand, the root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R2) metrics were calculated to evaluate the performance of ML algorithms in potato yield prediction. Both quantitative and visual evaluations showed close similarity between the synthetic and original data. The average JSD (KLD) in CS1 and CS2 were 0.00028 (0.0031) and 0.161 (0.271), respectively. The ML algorithms showed noticeable differences when it comes to utilizing synthetic data. The RFR, XGB, DNN, and SAE algorithms positively responded to the addition of synthetic data, while SVR and KNN were the only ML algorithms that negatively responded to the addition of synthetic data. The DNN algorithm exhibited the highest positive response to the addition of synthetic data with an average RMSE change of −2.35 point percentage in CS1 and −24.54 point percentage in CS2. Although none of the ML algorithms and synthetic sample sizes provided the highest prediction performance in all the settings, which was plausible given the inherent differences among the selected ML algorithms, the RFR algorithm trained with the combination of original and quintupled synthetic data was the most appropriate choice for potato yield prediction.

Full Text