The ability to accurately predict the yields of different crop genotypes in response to weather variability is crucial for developing climate resilient crop cultivars. Genotype-environment interactions introduce large variations in crop-climate responses, and are hard to factor in to breeding programs. Data-driven approaches, particularly those based on machine learning, can help guide breeding efforts by factoring in genotype-environment interactions when making yield predictions. Using a new yield dataset containing 93,028 records of soybean hybrids across 159 locations, 28 states, and 13 years, with 5,838 distinct genotypes and daily weather data over a 214-day growing season, we developed two convolutional neural network (CNN) models: one that integrates CNN and fully-connected neural networks (CNN model), and another that incorporates a long short-term memory (LSTM) layer after the CNN component (CNN-LSTM model). By applying the Generalized Ensemble Method (GEM), we combined the CNN-based models and optimized their weights to improve overall predictive performance. The dataset provided unique genotype information on seeds, enabling an investigation into the potential of planting different genotypes based on weather variables. We employed the proposed GEM model to identify the best-performing genotypes across various locations and weather conditions, making yield predictions for all potential genotypes in each specific setting. To assess the performance of the GEM model, we evaluated it on unseen genotype-location combinations, simulating real-world scenarios where new genotypes are introduced. By combining the base models, the GEM ensemble approach provided much better prediction accuracy compared to using the CNN-LSTM model alone and slightly better accuracy than the CNN model, as measured by both RMSE and MAE on the validation and test sets. The proposed data-driven approach can be valuable for genotype selection in scenarios with limited testing years. In addition, we explored the impact of incorporating state-level soil data alongside the weather, location, genotype and year variables. Due to data constraints, including the absence of latitude and longitude details, we used uniform soil variables for all locations within the same state. This limitation restricted our spatial information to state-level knowledge. Our findings suggested that integrating state-level soil variables did not substantially enhance the predictive capabilities of the models. We also performed a feature importance analysis using RMSE change to identify crucial predictors. Location showed the highest RMSE change, followed by genotype and year. Among weather variables, maximum direct normal irradiance (MDNI) and average precipitation (AP) displayed higher RMSE changes, indicating their importance.
Read full abstract