Modeling maize yield and agronomic efficiency using machine learning models: A comparative analysis
Abstract Machine learning (ML) is increasingly being used to enhance yield predictions and optimize agronomic practices in sub‐Saharan Africa. Yet, understanding how these models generalize across heterogenous ecological context remains unresolved. This study, conducted in Ghana, evaluates the predictive performance of four ML models, namely, random forest (RF), support vector machine (SVM), k ‐nearest neighbors (KNN), and extreme gradient boosting (XGBoost) for predicting maize yield and agronomic efficiency—defined as the increase in yield per unit of nutrient applied. It also compares variable importances identified by these models and how they influence yield and agronomic efficiency. The analysis used 4496 georeferenced maize trial datasets from various agroecological zones across Ghana, incorporating 35 variables related to soil properties, climate, topography, crop management, and fertilizer application. Model performance was assessed using three cross‐validation techniques: leave‐one‐out, leave‐site‐out, and leave‐agroecological‐zone‐out. Accuracy was measured using mean error, root mean square error (RMSE), and model efficiency coefficient. When evaluated under leave‐one‐out cross‐validation, XGBoost consistently achieved the highest predictive accuracy with the lowest RMSE for yield (639.5 kg ha −1 ) and for agronomic efficiency of nitrogen (11.6 kg kg −1 ), which is moderate given the high variability in on‐farm nutrient response. RF also performed well, while KNN and SVM showed poor extrapolation under stringent validation. Nitrogen application rate, rainfall, and crop genotype were consistently identified as the most influential explanatory variables across all models, providing insight into key drivers of productivity. These findings demonstrate the power of ML techniques in supporting agricultural planning and improving maize production in sub‐Saharan Africa.
- Preprint Article
- 10.5194/egusphere-egu25-9987
- Mar 18, 2025
Background: Agriculture is increasingly leveraging machine learning (ML) to enhance yield predictions and optimize agronomic practices. Maize, a staple crop in Ghana, offers a valuable case study for evaluating the effectiveness of diverse ML models in yield prediction and resource management.Objective: This study aims to evaluate the predictive performance of four ML models namely Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbours (KNN), and Extreme Gradient Boosting (XGBoost) for maize yield and agronomic efficiency prediction. It also compares variable importance across these models to determine key explanatory variables.Methods: The study utilized 4,496 georeferenced maize trial datasets from various agroecological zones in Ghana. Thirty-five explanatory variables included soil properties, climate, topography, crop management practices, and fertilizer application datasets. Model performance was evaluated using leave-one-out, leave-site-out, and leave-agroecological-zone-out cross-validation techniques. Metrics including Mean Error (ME), Root Mean Squared Error (RMSE), and Model Efficiency Coefficient (MEC) were used to compare model accuracy, while a permutation-based approach was employed to assess variable importance.Results: XGBoost emerged as the most accurate model, achieving the lowest RMSE for yield (639.5 kg ha⁻¹) and agronomic efficiency (11.6 kg kg⁻¹), particularly for nitrogen (AE-N). RF demonstrated competitive performance, while KNN and SVM yielded inconsistent results under rigorous cross-validation conditions. Key explanatory variables identified across models included nitrogen fertilizer, rainfall, and crop genotype, underscoring their critical role in yield and agronomic efficiency outcomes.Conclusion: XGBoost was the most robust and accurate model for maize yield and agronomic efficiency predictions, offering a reliable tool for data-driven agricultural planning in diverse agroecological settings. The findings underscore the transformative role of advanced ML techniques in modern agriculture, particularly in optimizing staple crop production in sub-Saharan Africa.
- Research Article
37
- 10.1016/j.fcr.2022.108640
- Oct 1, 2022
- Field Crops Research
Interpretable machine learning methods to explain on-farm yield variability of high productivity wheat in Northwest India
- Research Article
26
- 10.1016/j.isprsjprs.2023.05.015
- May 24, 2023
- ISPRS Journal of Photogrammetry and Remote Sensing
Utilization of synthetic minority oversampling technique for improving potato yield prediction using remote sensing data and machine learning algorithms with small sample size of yield data
- Research Article
32
- 10.1016/j.jnoncrysol.2021.121000
- Jun 27, 2021
- Journal of Non-Crystalline Solids
Prediction of glass forming ability in amorphous alloys based on different machine learning algorithms
- Conference Article
1
- 10.1109/ic3i56241.2022.10072476
- Dec 14, 2022
The real estate market is one of the least transparent sectors of society as real estate prices change daily and are often overvalued rather than valued. Homebuyers use budget and market methods to find new homes. However, a fundamental problem with the current approach is the inability to predict future market trends that will lead to price spikes. It is very important for researchers to base their house price proposals on empirical studies. In order to accurately predict the price of a home, customers need to carefully evaluatefactors related to the home, which is very difficult. Using machine learning (ML) to solve this problem seems like a viable solution. To address this problem, ML models such as Linear Regression (LR), K Nearest Neighbors (KNN), Random Forests (RF); Ensembles (LR, KNN, RF) are used. A number of error metrics are used to select the best model, including mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE). The results in this disclosure show that a model combining linear regression (LR), random forest (RF) and K Nearest Neighbors (KNN) yields the lowest inaccuracies. A successful regression model should have a minimal error value. This eliminates the need to rely on realtors to determine a fair price for a home based on key features.
- Research Article
3
- 10.3390/info13110519
- Oct 31, 2022
- Information
Wireless network parameters such as transmitting power, antenna height, and cell radius are determined based on predicted path loss. The prediction is carried out using empirical or deterministic models. Deterministic models provide accurate predictions but are slow due to their computational complexity, and they require detailed environmental descriptions. While empirical models are less accurate, Machine Learning (ML) models provide fast predictions with accuracies comparable to that of deterministic models. Most Empirical models are versatile as they are valid for various values of frequencies, antenna heights, and sometimes environments, whereas most ML models are not. Therefore, developing a versatile ML model that will surpass empirical model accuracy entails collecting data from various scenarios with different environments and network parameters and using the data to develop the model. Combining datasets of different sizes could lead to lopsidedness in accuracy such that the model accuracy for a particular scenario is low due to data imbalance. This is because model accuracy varies at certain regions of the dataset and such variations are more intense when the dataset is generated from a fusion of datasets of different sizes. A Dynamic Regressor/Ensemble selection technique is proposed to address this problem. In the proposed method, a regressor/ensemble is selected to predict a sample point based on the sample’s proximity to a cluster assigned to the regressor/ensemble. K Means Clustering was used to form the clusters and the regressors considered are K Nearest Neighbor (KNN), Extreme Learning Trees (ET), Random Forest (RF), Gradient Boosting (GB), and Extreme Gradient Boosting (XGBoost). The ensembles are any combinations of two, three or four of the regressors. The sample points belonging to each cluster were selected from a validation set based on the regressor that made prediction with lowest absolute error per individual sample point. Implementation of the proposed technique resulted in accuracy improvements in a scenario described by a few sample points in the training data. Improvements in accuracy were also observed on datasets in other works compared to the accuracy reported in the works. The study also shows that using features extracted from satellite images to describe the environment was more appropriate than using a categorical clutter height value.
- Research Article
26
- 10.3390/met12010050
- Dec 27, 2021
- Metals
The present work focusses on machine learning assisted predictions of the fatigue crack growth rate (FCGR) of Ti6Al4V (Ti64) processed through laser powder bed fusion (L-PBF) and post processing. Various machine learning techniques have provided a flexible approach for explaining the complex mathematical interrelationship among processing-structure-property of the materials. In the present work, four machine learning (ML) algorithms, such as K- Nearest Neighbor (KNN), Decision Trees (DT), Random Forests (RF), and Extreme Gradient Boosting (XGB) algorithms are implemented to analyze the Fatigue Crack growth rate (FCGR) of Ti64 alloy. After tuning the hyper parameters for these algorithms, the trained models were found to estimate the unseen data as equally well as the trained data. The four tested ML models are compared with each other over the training as well as testing phase, based on their mean squared error and R2 scores. Extreme Gradient Boosting has performed better for the FCGR predictions providing least mean squared errors and higher R2 scores compared to other models.
- Research Article
34
- 10.1016/j.chemosphere.2022.135265
- Jun 9, 2022
- Chemosphere
Mapping of groundwater productivity potential with machine learning algorithms: A case study in the provincial capital of Baluchistan, Pakistan
- Research Article
63
- 10.1016/j.rser.2023.113967
- Oct 21, 2023
- Renewable and Sustainable Energy Reviews
Analyzing electric vehicle battery health performance using supervised machine learning
- Research Article
4
- 10.1016/j.imed.2023.08.002
- Apr 30, 2024
- Intelligent Medicine
A clinical decision support system using rough set theory and machine learning for disease prediction
- Research Article
15
- 10.1016/j.heliyon.2024.e37065
- Aug 28, 2024
- Heliyon
Maize (Zea mays) is an important staple crop for food security in Sub-Saharan Africa. However, there is need to increase production to feed a growing population. In Ghana, this is mainly done by increasing acreage with adverse environmental consequences, rather than yield increment per unit area. Accurate prediction of maize yields and nutrient use efficiency in production is critical to making informed decisions toward economic and ecological sustainability. We trained the random forest machine learning algorithm to predict maize yield and agronomic efficiency in Ghana using soil, climate, environment, and management factors, including fertilizer application. We calibrated and evaluated the performance of the random forest machine learning algorithm using a 5 × 10-fold nested cross-validation approach. Data from 482 maize field trials consisting of 3136 georeferenced treatment plots conducted in Ghana from 1991 to 2020 were used to train the algorithm, identify important predictor variables, and quantify the uncertainties associated with the random forest predictions. The mean error, root mean squared error, model efficiency coefficient and 90 % prediction interval coverage probability were calculated. The results obtained on test data demonstrate good prediction performance for yield (MEC = 0.81) and moderate performance for agronomic efficiency (MEC = 0.63, 0.55 and 0.54 for AE-N, AE-P and AE-K, respectively). We found that climatic variables were less important predictors than soil variables for yield prediction, but temperature was of key importance to yield prediction and rainfall to agronomic efficiency. The developed random forest models provided a better understanding of the drivers of maize yield and agronomic efficiency in a tropical climate and an insight towards improving fertilizer recommendations for sustainable maize production and food security in Sub-Saharan Africa.
- Research Article
25
- 10.1016/j.jrmge.2022.08.001
- Oct 1, 2022
- Journal of Rock Mechanics and Geotechnical Engineering
A generic framework for geotechnical subsurface modeling with machine learning
- Research Article
8
- 10.18488/76.v9i2.3065
- Jul 18, 2022
- Review of Computer Engineering Research
Digital enterprises that use various Internet of Things implementation prototypes, such as cloud, mobile, and edge equipment, are experiencing unprecedented traffic volume and dynamicity. Data center networks (DCN) have faced various issues due to the transient and random nature of traffic created by services and apps. The primary objective of this paper is to predict the network traffic using the machine learning (ML) models before the performance of the network start degrading. Because in the last decade, ML has had a tremendous impact on handling the massive amount of data. With the increase in complexity and traffic, we tried to implement four ML models such as K – Nearest Neighbor (KNN), Random Forest (RF), Gradient Boosting (GB), and Decision Tree (Tree), with tuned sub-parameters to predict the network traffic. We create a matching ML environment based on a sequential database and provide a comparison table of mean square error (MSE), root mean square error (RMSE), mean absolute error (MSE), and coefficient of determination (R2) for each prototype. The simulation results show that the GB with different types is the best-suited model for predicting the network traffic with performance matrix parameters such as MSE 0.001 and RMSE of 0.030. Therefore, the Orange tool is used to stimulate the predictive models.
- Research Article
- 10.55606/jeei.v5i3.5742
- Oct 30, 2025
- Journal of Engineering, Electrical and Informatics
Thyroid illness is one of the most prevalent medical problems that has a direct impact on a person's physical and emotional well-being. The 2017–2020 NHANES data, which is extensive and contains a wide variety of 6,992 people and XX characteristics, is the source of the ML used in this study. Improving the early identification and classification of vulnerable people is the goal of this study. The machine learning techniques used in this study include K-Nearest Neighbor (KNN), Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR), Extreme Gradient Boosting (EGB), LightGBM (LGBM), Multi-Layer Perceptron (MLP), and Gradient Boosting. Evaluation of these algorithms revealed that RF, EGB, and LGBM exhibited exceptional accuracy, reaching an impressive 0.90. Among them, RF demonstrated the highest precision at 0.98, showcasing its ability to correctly identify individuals at risk with a high degree of confidence. Moreover, the study identified KNN as the algorithm with the highest recall value, reaching 0.73, highlighting its effectiveness in capturing a substantial proportion of true positive cases. EGB emerged with the highest F1-Score, shows a proportionate balance between recall and accuracy. Additionally, EGB displayed the highest Area Under the Curve (AUC) at 0.82, underscoring its robust predictive capabilities. This research underscores the pivotal role of ML algorithms in predicting and classifying thyroid disease risk, offering valuable insights for early intervention and personalized healthcare strategies. The high accuracy, precision, and recall values observed with RF, EGB, and LGBM suggest their potential as powerful tools for improving diagnostic capabilities in the realm of thyroid disease, contributing to more effective and timely patient care. As advancements in machine learning continue, the integration of these techniques into healthcare frameworks holds promise for enhancing our understanding and management of thyroid disorders.
- Research Article
1
- 10.1016/j.prostr.2024.02.044
- Jan 1, 2024
- Procedia Structural Integrity
Fatigue Life and Crack Growth Rate Prediction of Additively Manufactured 17-4 PH Stainless Steel using Machine Learning
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.