The effectiveness of supervised ML heavily depends on having a large, accurate, and diverse annotated dataset, which poses a challenge in applying ML for yield prediction. To address this issue, we developed a self-training random forest algorithm capable of automatically expanding the annotated dataset. Specifically, we trained a random forest regressor model using a small amount of annotated data. This model was then utilized to generate new annotations, thereby automatically extending the training dataset through self-training. Our experiments involved collecting data from over 30 winter wheat varieties during the 2019–2020 and 2021–2022 growing seasons. The testing results indicated that our model achieved an R2 of 0.84, RMSE of 627.94 kg/ha, and MAE of 516.94 kg/ha in the test dataset, while the validation dataset yielded an R2 of 0.81, RMSE of 692.96 kg/ha, and MAE of 550.62 kg/ha. In comparison, the standard random forest resulted in an R2 of 0.81, RMSE of 681.02 kg/ha, and MAE of 568.97 kg/ha in the test dataset, with validation results of an R2 of 0.79, RMSE of 736.24 kg/ha, and MAE of 585.85 kg/ha. Overall, these results demonstrate that our self-training random forest algorithm is a practical and effective solution for expanding annotated datasets, thereby enhancing the prediction accuracy of ML models in winter wheat yield forecasting.
Read full abstract