Machine Learning Models for Predicting the Occurrence of Respiratory Diseases Using Climatic and Air-Pollution Factors

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

ObjectivesBecause climatic and air-pollution factors are known to influence the occurrence of respiratory diseases, we used these factors to develop machine learning models for predicting the occurrence of respiratory diseases.MethodsWe obtained the daily number of respiratory disease patients in Seoul. We used climatic and air-pollution factors to predict the daily number of patients treated for respiratory diseases per 10,000 inhabitants. We applied the relief-based feature selection algorithm to evaluate the importance of feature selection. We used the gradient boosting and Gaussian process regression (GPR) methods, respectively, to develop two different prediction models. We also employed the holdout cross-validation method, in which 75% of the data was used to train the model, and the remaining 25% was used to test the trained model. We determined the estimated number of respiratory disease patients by applying the developed prediction models to the test set. To evaluate the performance of each model, we calculated the coefficient of determination (R2) and the root mean square error (RMSE) between the original and estimated numbers of respiratory disease patients. We used the Shapley Additive exPlanations (SHAP) approach to interpret the estimated output of each machine learning model.ResultsFeatures with negative weights in the relief-based algorithm were excluded. When applying gradient boosting to unseen test data, R2 and RMSE were 0.68 and 13.8, respectively. For GPR, the R2 and RMSE were 0.67 and 13.9, respectively. SHAP analysis showed that reductions in average temperature, daylight duration, average humidity, sulfur dioxide (SO2), total solar insolation amount, and temperature difference increased the number of respiratory disease patients, whereas increases in atmospheric pressure, carbon monoxide (CO), and particulate matter ≤2.5 μm in aerodynamic diameter (PM2.5) increased the number of respiratory disease patients.ConclusionWe successfully developed models for predicting the occurrence of respiratory diseases using climatic and air-pollution factors. These models could evolve into public warning systems.

Similar Papers
  • Book Chapter
  • 10.3233/atde240671
Core Loss Estimation for Three Phase Transformer Based on GPR and FEA
  • Oct 17, 2024
  • Seda Kul + 2 more

This study used the Gaussian Process Regression (GPR) method to predict the core losses of the Finite Element Analysis (FEA) based dry-type three-phase transformer. In the estimation and analysis processes, the core area Ac, primary excitation voltage Vp and the primary winding number of turns Np are used as three input parameters. GPR is a powerful machine learning method for such low-featured data and provides a Bayesian-based regression capable of measuring uncertainty in predictions. The data generated in the ANSYS/MAXWELL environment for core loss estimation is chosen at random using the parametric FEA setup. The Matern 5/2 kernel function is used to train these data using GPR. Thus, the results are pretty satisfactory; Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Squared Error (MSE) performance metric values are calculated as 0.0102, 0.0029, and 0.0534, respectively. Also, the estimated results are very close to the simulation value. As a result, the GPR method can be used as a reliable tool for estimating core losses with high accuracy during the transformer design stage.

  • Research Article
  • 10.1108/mmms-08-2025-0323
Multifactor machine learning prediction of key properties of foam concrete: model selection and parameter sensitivity analysis
  • Dec 18, 2025
  • Multidiscipline Modeling in Materials and Structures
  • Sen Yang + 7 more

Purpose The purpose of this study is to enhance the accuracy and interpretability of predicting the thermal conductivity of foam concrete under multiple influencing factors. Unlike prior research that often relied on limited input variables and focused primarily on mechanical strength, this work integrates six machine learning models to evaluate performance across a broader parameter set. By identifying the most effective model and conducting sensitivity analysis through Shapley Additive Explanations (SHAP) values, the study aims to provide reliable predictive tools and theoretical guidance for optimizing foam concrete's thermal insulation properties in engineering applications. Design/methodology/approach This study collected a large amount of data on foam concrete thermal conductivity, incorporating density, water-to-cement ratio, supplementary cementitious materials (SCM), fine aggregate-to-binder ratio, curing time, and superplasticizer as inputs. Data were preprocessed through standardization, outlier removal and 5-fold cross-validation to ensure reliability. Six machine learning models – Gaussian Process Regression (GPR), Ensemble Tree, Linear Regression (LR), Neural Network (NN), Regression Tree (RT) and Support Vector Machine (SVM) – were trained and tested. Model performance was assessed using R2, root mean squared error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE). SHAP were applied to the optimal model to quantify feature contributions and enhance interpretability. Findings GPR outperformed other models in predicting foam concrete thermal conductivity, achieving R2 values of 0.97 (training) and 0.88 (testing), with low error metrics, demonstrating strong accuracy and generalization. Neural Networks and SVM also showed reasonable performance, while LR and RT performed poorly. Sensitivity analysis using SHAP revealed density (46.3%) and water-to-cement ratio (20.3%) as the dominant factors influencing conductivity, whereas SCM and superplasticizer had minimal effects. These results highlight GPR's robustness and confirm density and W/C as the key parameters governing foam concrete's thermal insulation performance. Originality/value This study extends existing foam concrete research by shifting focus from compressive strength prediction to thermal conductivity, a critical property for insulation performance. Unlike prior work limited to three or four variables, it integrates six input parameters and evaluates six machine learning models, offering a more comprehensive and multidimensional analysis. The inclusion of SHAP provides novel interpretability, clarifying feature contributions and enhancing model transparency. By combining broad parameter input with explainable artificial intelligence, this work delivers both methodological innovation and practical guidance for optimizing foam concrete’s thermal performance.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/w18010026
Machine Learning and SHAP-Based Prediction of Tip Velocity Around Spur Dikes Using a Small-Scale Experimental Dataset
  • Dec 21, 2025
  • Water
  • Nadir Murtaza + 6 more

River-training structures such as spur dikes are frequently used in the field of river engineering, which play a critical role in flow regulation and stabilization of the riverbank. However, previous studies lack a precise prediction of factors inducing scour and turbulence phenomena, such as tip velocity, for optimal design of the spur dikes. This study addresses a key gap in previous research by predicting tip velocity around spur dikes using advanced and interpretable machine learning models while simultaneously evaluating the influence of key geometric and hydraulic parameters. For this purpose, the current study utilized advanced artificial intelligence (AI) techniques like Gaussian Process Regression (GPR), Categorical Boosting (CatBoost), Random Forest (RF), and Extreme Gradient Boosting (XGBoost), optimized with Particle Swarm Optimization (PSO), to predict tip velocity in the vicinity of the spur dike. In this paper, a small dataset of 69 laboratory-scale experimental trials was collected; therefore, the chosen AI models were selected for their ability to handle such limited data points. In this study, the input parameters included Froude number (Fr), separation length to spur dike length ratio (L/l), and incidence angle (β), while the output parameter was tip velocity. The selected four AI models were trained on 70%, 15%, and 15% of the data for the training, testing, and validation phases, respectively. SHapley Additive exPlanations (SHAP) analysis was used to observe the influence of the critical parameters on the tip velocity. The results demonstrated the superior performance of GPR, followed by the CatBoost model, compared to other models. GPR and CatBoost show greater values of coefficient of determination (R2) (GPR R2 = 0.972 and CatBoost R2 = 0.970) and lower values of root mean square error (RMSE) (GPR RMSE = 0.0107 and CatBoost RMSE = 0.0236). The result of the heatmap and SHAP analysis indicated a greater influence of Fr and L/l and a lower impact of β on the tip velocity. The results of this study recommend the utilization of GPR and CatBoost for precise and robust performance of the hydrodynamic phenomenon around the spur dikes, supporting scour mitigation strategies in river engineering.

  • Research Article
  • Cite Count Icon 1
  • 10.1038/s41598-025-17588-9
Enhancing wellbore stability through machine learning for sustainable hydrocarbon exploitation
  • Oct 9, 2025
  • Scientific Reports
  • Mohatsim Mahetaji + 1 more

Wellbore instability manifested through formation breakouts and drilling-induced fractures poses serious technical and economic risks in drilling operations. It can lead to non-productive time, stuck pipe incidents, wellbore collapse, and increased mud costs, ultimately compromising operational safety and project profitability. Accurately predicting such instabilities is therefore critical for optimizing drilling strategies and minimizing costly interventions. This study explores the application of machine learning (ML) regression models to predict wellbore instability more accurately, using open-source well data from the Netherlands well Q10-06. The dataset spans a depth range of 2177.80 to 2350.92 m, comprising 1137 data points at 0.1524 m intervals, and integrates composite well logs, real-time drilling parameters, and wellbore trajectory information. Borehole enlargement, defined as the difference between Caliper (CAL) and Bit Size (BS), was used as the target output to represent instability. Twelve regression models were evaluated, including Linear and Polynomial Regression, Decision Tree, Random Forest, Gradient Boosting, Histogram Gradient Boosting, Support Vector Regression, Multi-layer Perceptron, k-Nearest Neighbors, Gaussian and Bernoulli Naive Bayes, and Gaussian Process Regression. Model performance was assessed using the Root Mean Squared Error (RMSE) and Coefficient of Determination (DC). Among them, Histogram Gradient Boosting yielded the highest prediction accuracy (RMSE = 8.5138 ×10-2 in, DC = 0.99), followed closely by Gradient Boosting, Random Forest, and Decision Tree models. Conversely, Bernoulli Naive Bayes and Support Vector Regression demonstrated poor generalization. To interpret model predictions, SHAP (SHapley Additive exPlanations) analysis was employed, highlighting the most influential features and their directional impacts. The SHAP results aligned closely with heatmap-based feature correlations, confirming that high-performing models considered a diverse set of features, while underperforming models were overly reliant on limited inputs. This study demonstrates that bypassing traditional empirical correlations in data-driven machine learning techniques can enhance prediction accuracy while preserving model interpretability through SHAP analysis.Supplementary InformationThe online version contains supplementary material available at 10.1038/s41598-025-17588-9.

  • Research Article
  • Cite Count Icon 94
  • 10.1016/j.jhydrol.2021.126538
Improving terrestrial evapotranspiration estimation across China during 2000–2018 with machine learning methods
  • Jun 6, 2021
  • Journal of Hydrology
  • Lichang Yin + 4 more

Improving terrestrial evapotranspiration estimation across China during 2000–2018 with machine learning methods

  • Research Article
  • 10.47836/pjst.31.4.16
Comparison of Count Data Generalised Linear Models: Application to Air-Pollution Related Disease in Johor Bahru, Malaysia
  • May 25, 2023
  • Pertanika Journal of Science and Technology
  • Zetty Izzati Zulki Alwani + 3 more

Poisson regression is a common approach for modelling discrete data. However, due to characteristics of Poisson distribution, Poisson regression might not be suitable since most data are over-dispersed or under-dispersed. This study compared four generalised linear models (GLMs): negative binomial, generalised Poisson, zero-truncated Poisson and zero-truncated negative binomial. An air-pollution-related disease, upper respiratory tract infection (URTI), and its relationship with various air pollution and climate factors were investigated. The data were obtained from Johor Bahru, Malaysia, from January 1, 2012, to December 31, 2013. Multicollinearity between the covariates and the independent variables was examined, and model selection was performed to find the significant variables for each model. This study showed that the negative binomial is the best model to determine the association between the number of URTI cases and air pollution and climate factors. Particulate Matter (PM10), Sulphur Dioxide (SO2) and Ground Level Ozone (GLO) are the air pollution factors that affect this disease significantly. However, climate factors do not significantly influence the number of URTI cases. The model constructed in this study can be utilised as an early warning system to prevent and mitigate URTI cases. The involved parties, such as the local authorities and hospitals, can also employ the model when facing the risk of URTI cases that may occur due to air pollution factors.

  • Research Article
  • Cite Count Icon 180
  • 10.1109/jstars.2016.2575360
An Investigation Into Machine Learning Regression Techniques for the Leaf Rust Disease Detection Using Hyperspectral Measurement
  • Sep 1, 2016
  • IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
  • Davoud Ashourloo + 4 more

The complex impacts of disease stages and disease symptoms on spectral characteristics of the plants lead to limitation in disease severity detection using the spectral vegetation indices (SVIs). Although machine learning techniques have been utilized for vegetation parameters estimation and disease detection, the effects of disease symptoms on their performances have been less considered. Hence, this paper investigated on 1) using partial least square regression (PLSR), $\nu$ support vector regression ( $\nu$ -SVR), and Gaussian process regression (GPR) methods for wheat leaf rust disease detection, 2) evaluating the impact of training sample size on the results, 3) the influence of disease symptoms effects on the predictions performances of the above-mentioned methods, and 4) comparisons between the performances of SVIs and machine learning techniques. In this study, the spectra of the infected and non infected leaves in different disease symptoms were measured using a non imaging spectroradiometer in the electromagnetic region of 350 to 2500 nm. In order to produce a ground truth dataset, we employed photos of a digital camera to compute the disease severity and disease symptoms fractions. Then, different sample sizes of collected datasets were utilized to train each method. PLSR showed coefficient of determination ( $R^2$ ) values of 0.98 (root mean square error (RMSE) = 0.6) and 0.92 (RMSE = 0.11) at leaf and canopy, respectively. SVR showed $R^2$ and RMSE close to PLSR at leaf ( $R^2$ = 0.98, RMSE = 0.05) and canopy ( $R^2$ = 0.95, RMSE = 0.12) scales. GPR showed $R^2$ values of 0.98 (RMSE = 0.03) and 0.97 (RMSE = 0.11) at leaf and canopy scale, respectively. Moreover, GPR represents better performances than others using small training sample size. The results represent that the machine learning techniques in contrast to SVIs are not sensitive to different disease symptoms and their results are reliable.

  • Research Article
  • Cite Count Icon 96
  • 10.1016/j.envres.2007.10.003
A time-series analysis of any short-term effects of meteorological and air pollution factors on preterm births in London, UK
  • Nov 19, 2007
  • Environmental Research
  • Sue J Lee + 3 more

A time-series analysis of any short-term effects of meteorological and air pollution factors on preterm births in London, UK

  • Research Article
  • Cite Count Icon 2
  • 10.20517/wecn.2024.70
Prediction of arsenic (III) adsorption from aqueous solution using non-neural network algorithms
  • Dec 10, 2024
  • Water Emerging Contaminants & Nanoplastics
  • Nazmul Hassan Mirza + 1 more

Heavy metals such as arsenic can be effectively removed through adsorption. Through material property evaluation and adsorption parameter optimization, machine learning (ML) modeling provides an alternative to lengthy laboratory experimentation. In this work, adsorption data from an earlier study employing a waste-material composite were used. To create prediction models, four non-neural network algorithms - support vector machines (SVM), Gaussian process regression (GPR), linear regression, and ensemble approaches - were used and contrasted with neural network algorithms. Nine predictors were utilized, ranging from adsorbent composition alterations to experimental circumstances. Using principal component analysis (PCA) and feature selection, together with the F-test and minimum redundancy maximum relevance (MRMR) algorithms for feature reduction, optimization was accomplished. With an R-squared of 0.939, mean absolute error (MAE) of 5.778, and root mean squared error (RMSE) of 7.119 for training and an R-squared of 0.942, MAE of 5.450, and RMSE of 6.870 for testing, the optimized GPR method offered the best predictive performance. The best R-squared values found for other algorithms were: SVM (0.922), linear regression (0.925), and ensemble (0.927). The most important variables influencing adsorption efficiency were initial arsenic concentration, time, and the iron salt content. Local interpretable model-agnostic explanations (LIME), partial dependence plot (PDP), and Shapley additive explanations (SHAP) plots were used to explain these results. This work shows that, based on model-derived parameters, non-neural network algorithms may efficiently simulate and optimize arsenic adsorption tests, providing a trustworthy substitute for neural network techniques and markedly increasing adsorption efficiency.

  • Research Article
  • Cite Count Icon 101
  • 10.1155/2019/2859429
Gaussian Process Regression Tuned by Bayesian Optimization for Seawater Intrusion Prediction
  • Jan 17, 2019
  • Computational Intelligence and Neuroscience
  • George Kopsiaftis + 4 more

Accurate prediction of the seawater intrusion extent is necessary for many applications, such as groundwater management or protection of coastal aquifers from water quality deterioration. However, most applications require a large number of simulations usually at the expense of prediction accuracy. In this study, the Gaussian process regression method is investigated as a potential surrogate model for the computationally expensive variable density model. Gaussian process regression is a nonparametric kernel-based probabilistic model able to handle complex relations between input and output. In this study, the extent of seawater intrusion is represented by the location of the 0.5 kg/m3 iso-chlore at the bottom of the aquifer (seawater intrusion toe). The initial position of the toe, expressed as the distance of the specific line from a number of observation points across the coastline, along with the pumping rates are the surrogate model inputs, whereas the final position of the toe constitutes the output variable set. The training sample of the surrogate model consists of 4000 variable density simulations, which differ not only in the pumping rate pattern but also in the initial concentration distribution. The Latin hypercube sampling method is used to obtain the pumping rate patterns. For comparison purposes, a number of widely used regression methods are employed, specifically regression trees and Support Vector Machine regression (linear and nonlinear). A Bayesian optimization method is applied to all the regressors, to maximize their efficiency in the prediction of seawater intrusion. The final results indicate that the Gaussian process regression method, albeit more time consuming, proved to be more efficient in terms of the mean absolute error (MAE), the root mean square error (RMSE), and the coefficient of determination (R2).

  • Research Article
  • 10.35940/ijeat.c4745.15030226
Comparative Analysis of GPR and RBF Models for Predicting the Breakdown Voltage of Insulating Oils
  • Feb 28, 2026
  • International Journal of Engineering and Advanced Technology
  • Hadj Mahmoud Mahmoudi + 2 more

Accurately forecasting the breakdown voltage of insulating oils is a prerequisite for the reliable design and operation of high-voltage equipment. The present work focuses on developing data-driven artificial intelligence (AI) models to predict the breakdown voltage of transformer oil as a function of temperature and electrode spacing. Two different machine learning algorithms are applied and compared: Gaussian Process Regression (GPR) and Radial Basis Function (RBF) neural network. The experimental data for electrode distances of 5 mm and 20 mm are used to train, test, and validate the models using a 60/20/20 data-splitting scheme. The predictive capacity of the models is evaluated using the three metrics: mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R). Experimental results confirm that the model predictions are in excellent agreement with the measurements at short electrode distances for both models. Nevertheless, at longer distances, the differences between the two performances become quite substantial. The GPR method is more reliable and generalises better, particularly at 20 mm, where it yields lower validation errors than the RBF approach. In addition, as a probabilistic method, GPR enables the estimation of predictive uncertainty, which is essential for applications oriented toward safety and dependability. Overall, the present work has demonstrated GPR's capability to determine the breakdown voltage of insulating oils and its potential for high-voltage insulation diagnostics and design.

  • Research Article
  • Cite Count Icon 16
  • 10.1177/09544062211050542
A comparative study of Gaussian process regression with other three machine learning approaches in the performance prediction of centrifugal pump
  • Dec 30, 2021
  • Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science
  • Xutao Zhao + 3 more

Accurate prediction of performance indices using impeller parameters is of great importance for the initial and optimal design of centrifugal pump. In this study, a kernel-based non-parametric machine learning method named with Gaussian process regression (GPR) was proposed, with the purpose of predicting the performance of centrifugal pump with less effort based on available impeller parameters. Nine impeller parameters were defined as model inputs, and the pump performance indices, that is, the head and efficiency, were determined as model outputs. The applicability of three widely used nonlinear kernel functions of GPR including squared exponential (SE), rational quadratic (RQ) and Matern5/2 was investigated, and it was found by comparing with the experimental data that the SE kernel function is more suitable to capture the relationship between impeller parameters and performance indices because of the highest R square and the lowest values of max absolute relative error (MARE), mean absolute proportional error (MAPE), and root mean square error (RMSE). In addition, the results predicted by GPR with SE kernel function were compared with the results given by other three machine learning models. The comparison shows that the GPR with SE kernel function is more accurate and robust than other models in centrifugal pump performance prediction, and its prediction errors and uncertainties are both acceptable in terms of engineering applications. The GPR method is less costly in the performance prediction of centrifugal pump with sufficient accuracy, which can be further used to effectively assist the design and manufacture of centrifugal pump and to speed up the optimization design process of impeller coupled with stochastic optimization methods.

  • Research Article
  • 10.3390/en19010124
A Day-Ahead Wind Power Dynamic Explainable Prediction Method Based on SHAP Analysis and Mixture of Experts
  • Dec 25, 2025
  • Energies
  • Hao Zhang + 5 more

Traditional single-prediction models often exhibit limitations in meeting wind power prediction requirements in complex operational scenarios. Furthermore, the inherent “black-box” nature of deep learning models leads to limited interpretability of predictions, hindering effective support for grid dispatch planning. To address these issues, this study proposes a novel day-ahead wind power prediction method, referred to as SHapley Additive exPlanations (SHAP)–Mixture of Experts (MoE), which integrates SHAP into an MoE framework. Here, SHAP is employed for interpretability purposes. This study innovatively transforms SHAP analysis into prior knowledge to guide the decision-making of the MoE gating network and proposes a two-layer dynamic interpretation mechanism based on the collaborative analysis of gating weights and SHAP values. This approach clarifies key meteorological factors and the model’s advantageous scenarios, while quantifying the uncertainty among multiple expert decisions. Firstly, each expert model was pre-trained, and its parameters were frozen to construct a candidate expert pool. Secondly, the SHAP vectors for each pre-trained expert were computed over all sample features to characterize their decision-making logic under varying scenarios. Thirdly, an augmented feature set was constructed by fusing the original meteorological features with SHAP attribution matrices from all experts; this set was used to train the gating network within the MoE framework. Finally, for new input samples, each frozen expert model generates a prediction along with its corresponding SHAP vector, and the gating network aggregates these predictions to produce the final forecast. The proposed method was validated using operational data from an offshore wind farm located in southeastern China. Compared with the best individual expert model and traditional ensemble forecasting models, the proposed method reduces the Root Mean Square Error (RMSE) by 0.23% to 4.92%. Furthermore, the method elucidates the influence of key features on each expert’s decisions, offering insights into how the gating network adaptively selects experts based on the input features and expert-specific characteristics across different scenarios.

  • Research Article
  • Cite Count Icon 1
  • 10.1007/s10661-025-13999-3
Water quality parameters retrieval and nutrient status evaluation based on machine learning methods and Sentinel- 2 imagery: a case study of the Hongjiannao Lake.
  • Apr 15, 2025
  • Environmental monitoring and assessment
  • Ying Liu + 2 more

A timely and accurate understanding of lake water quality is significant for maintaining ecological balance, ensuring water resource security, and promoting regional sustainable development. However, due to the varying numerical ranges and characteristics of different water quality parameters (WQPs), the selection of optimal retrieval algorithms is also different, which undoubtedly increases the complexity of different WQPs retrieval. To solve this problem, this study took the Hongjianao Lake in China as the research object, based on the measured data of chlorophyll-a (Chl-a), turbidity (TU), chemical oxygen demand (COD), total nitrogen (TN), total phosphorus (TP), ammonia nitrogen (NH3-N), electrical conductivity (EC) and potential of hydrogen (pH) and Sentinel- 2 images, compared the ability of Boruta, recursive feature elimination (RFE) and shapley additive explanations (SHAP) methods to obtain the optimal feature subset. The random forest algorithm (RF), back propagation neural network algorithm (BP), and support vector machine algorithm (SVM) algorithms were used to retrieve lake water quality, and the coefficient of determination (R2), root mean square error (RMSE), mean absolute error (MAE), and the ratio of performance to deviation (RPD) were used to evaluate the prediction accuracy of multiple combined models from different aspects. The SHAP method was employed to quantify the contribution of input characteristics to WQPs. Subsequently, an integrated nutrient state index was established by utilizing the inversion results of Chl-a, COD, TN, TP, and NH3-N, along with the entropy weight method to assess the nutrient status level. The results showed that the optimal model, SHAP-RF, has better retrieval accuracy for WQPs (Chl-a, R2 = 0.66, RMSE = 0.28 µg/L; COD, R2 = 0.73, RMSE = 7.30 mg/L; EC, R2 = 0.69, RMSE = 160.58 us/cm; NH3-N, R2 = 0.59, RMSE = 0.11 mg/L; pH, R2 = 0.73, RMSE = 0.007; TN, R2 = 0.84, RMSE = 1.09 mg/L; TP, R2 = 0.65, RMSE = 0.015 mg/L; TU, R2 = 0.63 RMSE = 3.17 ntu). The most sensitive spectral bands for Chl-a and NH3-N were the combination of green and red-edge bands. The sum of blue and near-infrared (NIR) bands was the most important in the inversion of COD. The product of the red and NIR bands played a crucial role in pH inversion. The subtraction between the green and red bands was the first choice for EC inversion. The red-edge bands and their combination contribute significantly to TN inversion. TP was most sensitive to the red-edge bands and shortwave infrared bands. The red band exhibited the highest sensitivity to TU inversion. The primary pollutants in Hongjiannao Lake were TN, TP, and COD. The water quality had deteriorated, with 29% of the water exhibiting light nutrient status, 53% displaying middle nutrient status, and 18% enduring hyper nutrient status. The results were highly significant for precisely assessing the water quality and nutrient levels in lakes.

  • Research Article
  • 10.13227/j.hjkx.202412303
Ground-based Hyperspectral Coupled Interpretable Integrated Machine Learning for Salinity and pH Inversion in Agricultural Soils
  • Feb 8, 2026
  • Huan jing ke xue= Huanjing kexue
  • Hua-Yu Huang + 5 more

Soil salinity and alkalinity are key factors limiting sustainable agricultural development. Timely acquisition of salinity and alkalinity information is crucial for soil improvement and long-term fertility enhancement. After orthogonal signal correction (OSC) transformation of the hyperspectral reflectance, competitive adaptive reweighted sampling (CARS) was used to screen the characteristic bands of salinity and alkalinity information using the ground hyperspectral and measured soil salinity (SSC) and pH values of the Hetao Plain as data sources. Then, environmental variables and microwave remote sensing data were introduced to build the inversion models based on six integrated machine learning algorithms, including extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), and random forest (RF), and six integrated machine learning algorithms were used to build inversion models of SSC and pH. The models were visualized and analyzed using Shapley additive explanations (SHAP). The results showed that: ① The salinity and alkalinity grades of farmland soils in the Hetao Plain were generally mild to moderate, with strong spatial heterogeneity in salinity and alkalinity. ② The OSC transform optimized the structure of the spectral data, which greatly improved the resolution ability under the complex background. CARS effectively screened out the characteristic bands related to salinity and alkalinity information, and the SSC characteristic bands included 13 bands such as 450, 470, and 600 nm. The pH characteristic bands included 15 bands such as 680, 730, and 740 nm. ③ The AdaBoost algorithm performed optimally for SSC inversion with validation set Rp2, root mean square error (RMSE), and relative analysis error (RPD) of 0.852, 1.352, and 2.88, respectively, whereas pH was best with the XGBoost model, which had an Rp2, RMSE, and RPD of 0.908, 0.151, and 3.31, respectively. ④ SHAP analysis showed that the prediction models for SSC and pH reflected multifactorial synergies. Waveband and climate factors were the dominant factors in SSC modeling with a cumulative contribution of 80.8%. Soil attributes (24.88%) had the highest contribution to pH modeling, waveband data had the smallest contribution of 15.13%, microwave remote sensing data had limited contribution to salinity and alkalinity modeling, and the combination of multi-source data provided a strong support for the accurate monitoring of soil salinization and alkalization. The study conclusions help to promote sustainable land management and efficient agricultural production.

Save Icon
Up Arrow
Open/Close