Intelligent Porosity Prediction for Sandstone Reservoirs Using Machine Learning Techniques

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Accurate porosity prediction is essential for reliable reservoir characterization in data-limited and heterogeneous formation. Traditional approaches generally have a difficulty handling the inherent complexities and uncertainties of well log data. This study applies and compares three machine learning (ML) approaches, including Artificial Neural Network optimized with Levenberg-Marquardt (ANN-LM), Random Forest (RF), Fuzzy Logic (FL) along with a baseline Multiple Linear Regression (MLR) model, to estimate total porosity from standard geophysical well-logs in three wells from the Mazalai Gas Field (MGF), Kohat Basin, Pakistan. The models utilize sonic, neutron porosity, bulk density, and gamma ray as input parameters. The ANN-LM model was trained using backpropagation and K-fold cross-validation. RF was implemented as an ensemble of decision trees with feature ranking, FL employed Gaussian membership functions in ten bins, and MLR served as a baseline linear method. Model performance was evaluated using the coefficient of determination (R²) and root mean square error (RMSE). ANN-LM showed the strongest generalizability and robustness, achieving R² = 0.99 and RMSE = 3.5 pu by effectively minimizing errors in complex, nonlinear and heterogenous data. RF and FL performed reasonably well achieving R 2 equal to 0.89 and 0.85 respectively, but showed reduced generalization to unseen data. MLR demonstrated the lowest performance acquiring R 2 =0.82. Additionally, A Taylor diagram analysis revealed that ANN-LM provided the most accurate and statistically consistent predictions, closely matching the reference data. These results show machine learning, especially well-optimised neural networks, greatly improves porosity prediction from logs, strengthening reservoir evaluation and development planning in MGF-like settings.

Similar Papers
  • Research Article
  • Cite Count Icon 36
  • 10.1016/j.jappgeo.2023.105067
Machine learning - a novel approach to predict the porosity curve using geophysical logs data: An example from the Lower Goru sand reservoir in the Southern Indus Basin, Pakistan
  • May 17, 2023
  • Journal of Applied Geophysics
  • Wakeel Hussain + 7 more

Machine learning - a novel approach to predict the porosity curve using geophysical logs data: An example from the Lower Goru sand reservoir in the Southern Indus Basin, Pakistan

  • Addendum
  • Cite Count Icon 1
  • 10.1007/s12145-021-00706-2
Correction to: Application of fuzzy logic and neural networks for porosity analysis using well log data: an example from the Chanda Oil Field, Northwest Pakistan
  • Sep 22, 2021
  • Earth Science Informatics
  • Natasha Khan + 1 more

Fuzzy logic (FL) and neural network (NNs) methods are commonly applied in a variety of areas in the petroleum industry. The area of hydrocarbon exploration has seen the greatest advancement of the soft-computing technologies including FL, and NNs. In this study, FL and NNs methods have been applied to log data from the Chanda Oil Field, northwest Pakistan, for porosity (PHIT) prediction. The input dataset for the study included four known logs, gamma ray (GR), neutron porosity (NPHI), density (RHOB), and sonic (DT) of two wells drilled in the Chanda Oil Field. For FL model, ten numbers of bins were selected. The closeness of fit (Cfit) curves were calculated considering the most, and second most likely curves. The weighted average final probability Pi, or the most likely solution, was also computed. The curve histogram distribution and the set of curve bin distribution cross plots were generated using the fuzzy model. In the FL model, the Gaussian membership function was the best fit for the well log data analyzed. FL models show Cfit fall in the range of 92–100% for Chanda-1 and Chanda Deep-1 with standard deviations 1.268 and 1.396, respectively. NNs models were generated for Chanda-1 and Chanda Deep-1 in the Datta Formation reservoir interval. The NNs model was trained using back propagation (BP) algorithm. NNs model reveals the Cfit_nn in the range of 85–100% for two wells with standard deviations 0.012 and 0.025. The results reveal a very good match between the log data and predicted modeling analyses using FL and NNs methods. These techniques can be applied to reduce uncertainty in determining the PHIT in wells. For comparison, the multiple linear regression (MLR) analysis was performed using the same log dataset of two studied wells. The coefficient of determination (R2) derived from the FL model (PHIT_ml) were 0.5727 and 0.7988, and 0.6256 and 0.8527 for NNs model (PHIT_nn) in two wells. In comparison, PHIT curve values for MLR (PHIT_mlr) were 0.512 and 0.338, The high R2 values indicate FL and NNs as reliable techniques for PHIT prediction compared to MLR method. The application of FL and NNs methods to well data indicates that these two methods can better determine the PHIT with an accuracy that rivals that of other methods, such as those based on statistics such as MLR. The corresponding correlation was obtained through a comparison of synthesized log values with real log values. The comparisons between the measured and predicted parameters using the two different methods FL and NNs indicated that both were successful in synthesizing PHIT logs. This paper indicates that for studied wells in the Chanda Oil Field, both FL and NNs are reliable, giving a realistic match for real and synthesized PHIT curve using a combination of GR, RHOB, NPHI and DT logs.

  • Research Article
  • Cite Count Icon 3
  • 10.3390/app15116319
Modeling Soil Temperature with Fuzzy Logic and Supervised Learning Methods
  • Jun 4, 2025
  • Applied Sciences
  • Bilal Cemek + 4 more

Soil temperature is a critical environmental factor that affects plant development, physiological processes, and overall productivity. This study compares two modeling approaches for predicting soil temperature at various depths: (i) fuzzy logic-based systems, including the Mamdani fuzzy inference system (MFIS) and the adaptive neuro-fuzzy inference system (ANFIS); (ii) supervised machine learning algorithms, such as multilayer perceptron (MLP), support vector regression (SVR), random forest (RF), extreme gradient boosting (XGB), and k-nearest neighbors (KNN), along with multiple Linear regression (MLR) as a statistical benchmark. Soil temperature data were collected from Tokat, Türkiye, between 2016 and 2024 at depths of 5, 10, 20, 50, and 100 cm. The dataset was split into training (2016–2021) and testing (2022–2024) periods. Performance was evaluated using the root mean square error (RMSE), the mean absolute error (MAE), and the coefficient of determination (R2). The ANFIS achieved the best prediction accuracy (MAE = 1.46 °C, RMSE = 1.89 °C, R2 = 0.95), followed by RF, XGB, MLP, KNN, SVR, MLR, and MFIS. This study underscores the potential of integrating machine learning and fuzzy logic techniques for more accurate soil temperature modeling, contributing to precision agriculture and better resource management.

  • Research Article
  • Cite Count Icon 20
  • 10.3390/en17153768
Advancing Reservoir Evaluation: Machine Learning Approaches for Predicting Porosity Curves
  • Jul 31, 2024
  • Energies
  • Nafees Ali + 7 more

Porosity assessment is a vital component for reservoir evaluation in the oil and gas sector, and with technological advancement, reliance on conventional methods has decreased. In this regard, this research aims to reduce reliance on well logging, purposing successive machine learning (ML) techniques for precise porosity measurement. So, this research examines the prediction of the porosity curves in the Sui main and Sui upper limestone reservoir, utilizing ML approaches such as an artificial neural networks (ANN) and fuzzy logic (FL). Thus, the input dataset of this research includes gamma ray (GR), neutron porosity (NPHI), density (RHOB), and sonic (DT) logs amongst five drilled wells located in the Qadirpur gas field. The ANN model was trained using the backpropagation algorithm. For the FL model, ten bins were utilized, and Gaussian-shaped membership functions were chosen for ideal correspondence with the geophysical log dataset. The closeness of fit (C-fit) values for the ANN ranged from 91% to 98%, while the FL model exhibited variability from 90% to 95% throughout the wells. In addition, a similar dataset was used to evaluate multiple linear regression (MLR) for comparative analysis. The ANN and FL models achieved robust performance as compared to MLR, with R2 values of 0.955 (FL) and 0.988 (ANN) compared to 0.94 (MLR). The outcomes indicate that FL and ANN exceed MLR in predicting the porosity curve. Moreover, the significant R2 values and lowest root mean square error (RMSE) values support the potency of these advanced approaches. This research emphasizes the authenticity of FL and ANN in predicting the porosity curve. Thus, these techniques not only enhance natural resource exploitation within the region but also hold broader potential for worldwide applications in reservoir assessment.

  • Peer Review Report
  • 10.7554/elife.86291.sa1
Decision letter: VO2max prediction based on submaximal cardiorespiratory relationships and body composition in male runners and cyclists: a population study
  • Apr 4, 2023
  • Beat Knechtle + 1 more

Full text Figures and data Side by side Abstract Editor's evaluation Introduction Materials and methods Results Discussion Data availability References Decision letter Author response Article and author information Metrics Abstract Background: Oxygen uptake (VO2) is one of the most important measures of fitness and critical vital sign. Cardiopulmonary exercise testing (CPET) is a valuable method of assessing fitness in sport and clinical settings. There is a lack of large studies on athletic populations to predict VO2max using somatic or submaximal CPET variables. Thus, this study aimed to: (1) derive prediction models for maximal VO2 (VO2max) based on submaximal exercise variables at anaerobic threshold (AT) or respiratory compensation point (RCP) or only somatic and (2) internally validate provided equations. Methods: Four thousand four hundred twenty-four male endurance athletes (EA) underwent maximal symptom-limited CPET on a treadmill (n=3330) or cycle ergometer (n=1094). The cohort was randomly divided between: variables selection (nrunners = 1998; ncyclist = 656), model building (nrunners = 666; ncyclist = 219), and validation (nrunners = 666; ncyclist = 219). Random forest was used to select the most significant variables. Models were derived and internally validated with multiple linear regression. Results: Runners were 36.24±8.45 years; BMI = 23.94 ± 2.43 kg·m−2; VO2max=53.81±6.67 mL·min−1·kg−1. Cyclists were 37.33±9.13 years; BMI = 24.34 ± 2.63 kg·m−2; VO2max=51.74±7.99 mL·min−1·kg−1. VO2 at AT and RCP were the most contributing variables to exercise equations. Body mass and body fat had the highest impact on the somatic equation. Model performance for VO2max based on variables at AT was R2=0.81, at RCP was R2=0.91, at AT and RCP was R2=0.91 and for somatic-only was R2=0.43. Conclusions: Derived prediction models were highly accurate and fairly replicable. Formulae allow for precise estimation of VO2max based on submaximal exercise performance or somatic variables. Presented models are applicable for sport and clinical settling. They are a valuable supplementary method for fitness practitioners to adjust individualised training recommendations. Funding: No external funding was received for this work. Editor's evaluation The authors have established new formulas to predict maximum oxygen uptake for cyclists and runners based on submaximal exercise testing and anthropometric characteristics. This is an important study with a large and comprehensive dataset, which may be helpful for many exercise labs. The work is convincing, using appropriate and validated methodology in line with the current state-of-the-art, as shown by references to common exercise books. https://doi.org/10.7554/eLife.86291.sa0 Decision letter Reviews on Sciety eLife's review process Introduction The oxygen uptake (VO2) is considered an important metric in assessing cardiorespiratory fitness, health status, or endurance performance potential (Guazzi et al., 2012). With the application of standardised procedures and interpretation protocols, during graded exercise tests (GXT), the (maximal oxygen uptake) VO2max can be established (Bentley et al., 2007). GXT is the most widely used assessment to examine the dynamic relationship between exercise and integrated physiological systems (Albouaini et al., 2007; Bentley et al., 2007). The information from GXT during cardiopulmonary exercise testing (CPET) can be applied across the spectrum of sport performance, occupational safety screening, research, and clinical diagnostics (Guazzi et al., 2017). VO2 max is often used as a boundary between severe and extreme intensity domains and by definition requires maximal effort from the tested subject (Gaesser and Poole, 1996). However, it is not always recommended or possible to undertake a test to exhaustion (Guazzi et al., 2012). For the athletes, the proximity of competition or injury history can allow submaximal testing, but not testing to exhaustion (Sassi et al., 2006). Testing that requires maximal effort may be disruptive to the training process or interfere with race performance (Coutts et al., 2007; Lamberts et al., 2011). Due to practical constraints, tests to exhaustion or peak-power-output tests are often performed only two or three times a year (Coutts et al., 2007). However, VO2 values are widely used in sport science and the decision-making process (Mann et al., 2013). VO2 is widely considered one of the major endurance performance determinants (Joyner and Coyle, 2008). Using VO2max to guide the selection process, prescribing training intensity, assessing training adaptations, or predicting race times is a common practice in high-performance sports (Bassett and Howley, 2000; Bentley et al., 2007; Hawley and Noakes, 1992; Noakes et al., 1990). VO2max is also one of the critical vital signs coordinating the function of the cardiovascular, respiratory, and muscular systems, it is an indicator of overall body health status (Kaminsky et al., 2017). Quantifying VO2max provides additional input regarding clinical decision-making, risk stratification, evaluation of therapy, and physical activity guidelines (Guazzi et al., 2012). For patients undertaking a test to exhaustion is rarely needed or possible due to health restraints or cardiac risk (Guazzi et al., 2016). For many years researchers have studied indirect methods of estimating VO2max(Sartor et al., 2013). Protocols such as the Astrand-Ryhming Test, Six-Minute Walk Test, or YMCA Step Test have been established and validated (Astrand and Ryhming, 1954; Beutner et al., 2015; Carey, 2022; Jalili et al., 2018). Moreover, estimation of the VO2 and heart rate (HR) values below the ventilatory threshold can be based on cardiorespiratory kinetics assessment using randomised changes in the work rate known as a pseudo-random binary sequences testing (Hoffmann et al., 2022). However, with the development of technology, the accessibility of laboratory testing and mobile testing improved (Montoye et al., 2020; Pritchard et al., 2021). Therefore, new opportunities to develop more precise yet simple and accessible methods and models to assess VO2max occur (Jurov et al., 2023). This appears to be especially important considering the low prediction accuracy of most of the VO2max formulae that were validated in our previous study (Wiecha et al., 2023). Recently, we have been observing the development of prediction methods with the usage of machine learning (ML) and artificial intelligence (AI) (Ashfaq et al., 2022). Both ML and AI are used in sport science as forecasting and decision-making support tools (Abut and Akay, 2015; Bobowik and Wiszomirska, 2022; Chmait and Westerbeek, 2021; Hammes et al., 2022; Rossi et al., 2021). There is growing evidence that VO2max prediction based on ML models, especially support vector ML and artificial neural network models, exhibits more robust and accurate results compared to MLR only (Abut and Akay, 2015; Ashfaq et al., 2022). Therefore, in this research, with the support of ML, we look for algorithms and prediction patterns that allow us to use values obtained during submaximal CPET and somatic measurements to estimate maximal VO2max values in male runners and cyclists. We stipulate that prediction models allow for accurate calculation of VO2max based on somatic or submaximal CPET variables. Materials and methods We have applied the development and validation of the prediction TRIPOD guidelines to conduct the study (see Supplementary Material 1TRIPOD Checklist for Prediction Model Development and Validation) (Collins et al., 2015). The study is based on retrospective data analysis from the CPET registry collected from 2013 to 2021 at the medical clinic (Sportslab, Warsaw, Poland). All CPET have been performed at the individual request of participants, as a part of regular training monitoring or performance assessment. Ethical approval Request a detailed protocol The Institutional Review Board of the Bioethical Committee at the Medical University of Warsaw (AKBE/32/2021) has approved the study protocol. The regulations of the Declaration of Helsinki were met during all parts of the study. Each study participant delivered written consent to undergo CPET and participate in the study. Derivation cohort Request a detailed protocol We selected the cohort with the use of rigorous exclusion/inclusion criteria. Due to the insufficient number of women in our database and the number of potential variables in the regression models for adequate power, we had to limit ourselves to conduct analysis in the male population only (Martens and Logan, 2021). Out of 6439 healthy, adult male cyclists and long-distance runners that undergone CPET, 4423 met the criteria as further: (1) age ≥18 years, (2) declared regular cycling or running training for ≥3 months, (3) had no extreme outliers ≤ or ≥±3 standard deviations (SD) from mean for all of the testing variables (beyond ≥±3 SD in VO2max), (4) lack of any injury, medical condition, or addiction in medical history that may affect exercise capacity, (5) not taking any medications with a modifying effect on exercise capacity, (6) maximum exertion achieved during CPET. We defined the maximum exertion in CPET as the fulfilment of the minimum six of the following criteria: (1) respiratory exchange ratio (RER) ≥1.10, (2) present VO2 plateau (growth <100 mL·min–1 in VO2 despite increased running speed or cycling power), (3) respiratory frequency (fR) ≥45 breaths·min–1, (4) declared subjective exertion intensity during CPET ≥18 in the Borg scale (Borg, 1970), (5) blood lactate concentration [La-]b ≥8 mmol·L–1, (6) growth in speed/power ≥10% of respiratory compensation point (RCP) values after exceeding the RCP, (7) peak heart rate (HRpeak) ≥15 beats·min–1 below predicted maximal heart rate (HRmax) (Lach et al., 2021). Participants’ selection procedure has been shown in Figure 1. Figure 1 Download asset Open asset Flowchart of the preliminary inclusion and exclusion process. Abbreviations: EA, endurance athlete; CPET, cardiopulmonary exercise testing; SD, standard deviation; TE, treadmill; RER, respiratory exchange ratio; VO2, oxygen uptake (mL·min−1·kg−1); [La−]b, lactate concentration (mmol·L−1); fR, breathing frequency (breaths·min−1); RCP, respiratory compensation point; HRpeak, peak heart rate (beats·min−1); HRmax, maximal heart rate (bpm). At both stages of the selection, some participants met several (>1) exclusion criteria. Somatic measurements and CPET protocols Request a detailed protocol Body mass was measured with a body composition (BC) analyser (Tanita, MC 718, Japan) with the multifrequency of 5 kHz/50 kHz/250 kHz via the bioimpedance analysis and normal testing mode. The participants’ skin was cleaned with alcohol before placing the electrodes on the skin. Prior to the test, the participants received instructions to refrain from exercising for 2 hr, consume a light meal rich in carbohydrates 2–3 hr beforehand, and maintain hydration by drinking isotonic beverages. Additionally, they were advised to abstain from medications, caffeine, and cigarettes on the day of the test. Running CPET (TE) was performed on a mechanical treadmill (h/p/Cosmos Quasar, Germany). Cycling CPET (CE) was performed on Cyclus-2 (RBM elektronik-automation GmbH, Leipzig, Germany). Hans Rudolph V2 mask (Hans Rudolph, Inc, Shawnee, KS, USA), breath-by-breath method with Cosmed Quark CPET gas exchange analysing device (Cosmed Srl, Rome, Italy), and Quark PFT Suite to Omnia 1.6 software were utilised. The gas analyser device was regularly calibrated with the reference gas (16% O2; 5% CO2) in accordance with the manufacturer’s instructions (Airgas USA, LLC, Plumsteadville, PA, USA). From 2013 to 2021, three Cosmed Quark CPET units were used. HR was measured with the Cosmed torso belt (Cosmed srl, Rome, Italy). [La-]b was measured via enzymatic-amperometric electrochemical technique with Super GL2 analyser (Müller Gerätebau GmbH, Freital, Germany). The [La-]b analyser was regularly calibrated before each measurement series. The 40 m2 indoor, air-conditioned laboratory with 20–22°C temperature and 40–60% humidity, and 100 m ASL provided the same conditions for all BC and CPET. Each CPET began with a 5 min personalised warm-up (walk or easy jog with ‘conversational’ intensity for running, easy pedalling with ‘conversational’ intensity for cycling). Then after the preparation (about 5 min), the continuous progressive step test was conducted. Due to the population diversity (training status), the running test speed started from 7 to 12 km·hr–1 with a 1% treadmill incline. The choice of initial starting speed was determined by the interview and sports results achieved. For example, those running less than 60 min at a distance of 10 km started the test at 7 km/hr, while those running 10 km for less than 35 min started the test at an initial speed of 12 km/hr. The pace increased by 1 km·hr–1 every 2 min with no change in incline. The cycling test began at 60–150 W, depending on the athletes training status. The power increased by 20–30 W every 2 min. It was recommended to maintain a constant cadence of 80–90 (repetition·min–1) during the test. The tests were terminated due to exhaustion: volitional inability to continue the activity or/and VO2 and HR plateau with increasing load or/and observed disturbance of coordination in running or/and inability to maintain the set cadence. Due to the graded protocol used, the cycling power and running speed values have been calculated as a function of time to better reflect the actual level for the test moment being determined (Kuipers et al., 1985). Before the test, after every step, and 3 min after the termination of the effort technician took a 20 µL blood sample from a fingertip. Samples were collected during the test without interrupting the effort. The samples were taken from the initial puncture. The first blood drop was collected into the swab and the second blood drop was drawn for further analysis into the capillary. VO2max was recorded as the highest value (15 s intervals) before the termination of the test. HRmax was recorded as the highest value obtained at the end of the test, without averaging. The anaerobic threshold (AT) was established with the following criteria: (1) common start of VE/VO2 and VE/VCO2 curves, (2) end-tidal partial pressure of oxygen raised constantly with the end-tidal partial pressure of carbon dioxide (Beaver et al., 1986). The was established with the following criteria: (1) PetCO2 must decrease after reaching maximal amount, (2) the presence of fast nonlinear growth in VE (second deflection), (3) the VE/VCO2 ratio achieved minimum and started to rise, and (4) a nonlinear increase in VCO2 versus VO2 (lack of linearity) (Beaver et al., 1986). The [La-]b was estimated for AT and RCP in relation to power or speed (Wiecha et al., 2022). Data analysis Request a detailed protocol Our comprehensive ML approach enables the evaluation of each formula by preliminary variables precision (at the stage of selection), then accuracy (during the model’s building) and recall (in internal validation). Individual CPET results were saved into the Excel file (Microsoft Corporation, Redmond, WA, USA) and a custom-made script was used to generate the database in Excel (Python programming). Further, mean, SD, and 95% confidence intervals (CI) were calculated. The normality of the distribution of the data was examined using the Shapiro-Wilk test and intergroup differences were calculated using the Student’s t-test for independent variables. Three-step variable selection procedures based on random forests were applied using the R package VSURF in RStudio software (R Core Team, Vienna, Austria; version 3.6.4) (Genuer et al., 2016). For each level of measurement (AT, RCP) and their combination (AT+RCP), significant variables were identified separately. The first step was dedicated to eliminate irrelevant variables from the dataset. Second step aimed to select all variables related to the response for interpretation purposes. The third step refined the selection by eliminating redundancy in the set of variables selected by the second step, for prediction purposes (Genuer et al., 2017). Each time for variables selection, the anthropometric variables as in Tables 1–2 and the CPET parameters given in Tables 3–4 from a specific level of measurement (AT; RCP) and their combinations were visible. Table 1 Basic anthropometric characteristics for runners. Variable (unit)Derivation group n=1998Testing group n=666Validation group n=666MeanCISDMeanCISDMeanCISDAge (years)36.235.6–36.98.4535.935.5–36.38.0535.534.9–36.28.14Height (cm)180.0179.6–180.56.04179.4179.1–179.76.13179.7179.2–180.26.61BM (kg)77.777.0–78.49.3577.777.3–78.19.2977.977.1–78.610.1BMI (kg·m–2)23.923.8–24.12.4324.124.0–24.22.4124.123.9–24.32.56BF (%)15.415.1–15.74.5515.515.3–15.74.5215.415.1–15.84.55FM (kg)12.211.9–12.64.6812.312.1–12.54.6512.311.9–12.74.92FFM (kg)65.565.0–66.06.4365.465.1–65.76.3165.665.1–66.16.86 BM, body mass; BMI, body mass index; BF, body fat; FM, fat mass; FFM, fat-free mass; CI, 95% confidence interval; SD, standard deviation. Table 2 Basic anthropometric characteristics for cyclists. Variable (unit)Derivation group n=656Testing group n=219Validation group n=219MeanCISDMeanCISDMeanCISDAge (years)37.336.6–38.09.1337.135.9–38.49.5037.636.5–38.88.46Height (cm)179.9179.4–180.46.27180.1179.2–181.06.96180.2179.4–181.06.13BM (kg)78.878.1–79.69.8079.177.7–80.510.479.878.4–81.310.9BMI (kg·m–2)24.324.1–24.62.6324.424.0–24.72.8024.624.2–25.02.96BF (%)16.415.7–17.14.9916.115.7–16.54.8116.215.5–16.84.87FM (kg)13.312.6–14.15.6613.012.6–13.45.2713.312.5–14.05.85FFM (kg)65.864.9–66.66.2565.865.4–66.36.0666.665.7–67.46.58 BM, body mass; BMI, body mass index; BF, body fat; FM, fat mass; FFM, fat-free mass; CI, 95% confidence interval; SD, standard deviation. Table 3 Cardiopulmonary exercise testing (CPET) characteristics for runners. Variable (unit)Derivation group n=1998Testing group n=666Validation group n=666MeanCISDMeanCISDMeanCISDrVO2AT (mL·min–1·kg–1)38.438.1–38.85.0138.538.3–38.74.8838.137.7–38.55.16RERAT0.870.86–0.870.040.870.86–0.870.040.870.86–0.870.04HRAT (beats·min–1)151.5150.8–152.310.3151.0150.6–151.510.8152.0151.2–152.810.8VEAT (L·min–1)79.178.1–80.012.278.377.8–78.912.077.276.3–78.212.0SPEEDAT (km·h–1)11.010.9–11.11.4511.011.0–11.11.3610.910.8–11.01.42LAAT (mmol·L–1)2.082.02–2.140.631.801.76–1.830.622.352.27–2.420.72rVO2RCP (mL·min–1·kg–1)47.547.0–48.05.8847.747.4–48.06.1547.346.8–47.86.16RERRCP1.001.00–1.000.041.001.00–1.000.041.001.00–1.000.03HRRCP (beats·min–1)173.4172.7–174.19.21173.2172.8–173.69.30174.3173.5–175.09.50VERCP (L·min–1)114.7113.5–116.015.9113.9113.1–114.616.7112.7111.4–114.016.2SPEEDRCP (km·h–1)14.013.9–14.11.7714.114.0–14.11.7013.913.8–14.11.75LARCP (mmol·L–1)4.724.63–4.821.044.404.34–4.451.044.814.69–4.931.14rVO2max (mL·min–1·kg–1)53.853.3–54.36.6754.354.0–54.66.9553.853.3–54.37.09 CI, 95% confidence interval; SD, standard deviation; rVO2AT, oxygen uptake at anaerobic threshold relative to body mass; RERAT, respiratory exchange ratio at anaerobic threshold; HRAT, heart rate at anaerobic threshold; VEAT, pulmonary ventilation at anaerobic threshold; SPEEDAT, velocity at anaerobic threshold; LAAT, blood lactate concentration at anaerobic threshold; rVO2RCP, oxygen uptake at respiratory compensation point relative to body mass; RERRCP, respiratory exchange ratio at respiratory compensation point; HRRCP, heart rate at respiratory compensation point; VERCP, pulmonary ventilation at respiratory compensation point; SPEEDRCP, velocity at respiratory compensation point; LARCP, blood lactate concentration at respiratory compensation point; rVO2max, maximal oxygen uptake relative to body mass. Table 4 Cardiopulmonary exercise testing (CPET) characteristics for cyclists. Variable (unit)Derivation group n=656Testing group n=219Validation group n=219MeanCISDMeanCISDMeanCISDrVO2AT (mL·min–1·kg–1)33.032.5–33.45.8433.232.4–33.95.6833.732.9–34.55.89RERAT0.870.87–0.870.040.870.87–0.880.040.870.87–0.880.04HRAT CI, 95% confidence interval; SD, standard deviation; rVO2AT, oxygen uptake at anaerobic threshold relative to body mass; RERAT, respiratory exchange ratio at anaerobic threshold; HRAT, heart rate at anaerobic threshold; VEAT, pulmonary ventilation at anaerobic threshold; power at anaerobic threshold relative to body mass; LAAT, blood lactate concentration at anaerobic threshold; rVO2RCP, oxygen uptake at respiratory compensation point relative to body mass; RERRCP, respiratory exchange ratio at respiratory compensation point; HRRCP, heart rate at respiratory compensation point; VERCP, pulmonary ventilation at respiratory compensation point; LARCP, blood lactate concentration at respiratory compensation point; power at respiratory compensation point relative to body mass; rVO2max, maximal oxygen uptake relative to body mass. selection variables were in the further only selected parameters were into multiple linear regression The data for MLR model building were randomly into that is testing, validation and of the a only significant were in the Derived are by the of mean and mean analysis was used to the model’s precision and accuracy during validation and tests to the fulfilment of MLR test the of in MLR test assessment between and test of Each model was examined the and any have not been 2 package in RStudio (R Core Team, Vienna, Austria; version version for and software version were used in was considered as the Results Somatic measurements and CPET results data of the runners models for testing, and validation are in Table while cyclists are in Table The runners of and for testing, and validation the cyclists and differences between of runners and cyclists were in BMI and between testing in all between validation only in CPET results for runners models are in Table 3 and for cyclists in Table Runners in the cohort achieved relative to body mass VO2max of in testing group and in validation group cyclists mean was and for testing, and validation to body mass oxygen uptake at anaerobic threshold in runners for ± ± and ± of in testing, and validation it was ± ± and ± of rVO2max, relative to body mass oxygen uptake at respiratory compensation point in runners for ± ± and ± of for testing, and validation while in cyclists for ± ± and ± of rVO2max, There were no significant differences in values between testing, and validation the runners and cyclists between runners and cyclists results were all significant Prediction models based on AT and RCP Full of MLR prediction models for cyclists are in Table for runners in Table The models prediction performance is as with and for cyclists from for somatic parameters to for RCP equations. For runners from for to for AT and equations. for cyclists models was the for RCP and the highest for For from for AT and to for equation. observed for cyclists was the for RCP in the validation group and the highest for while in runners the for AT and and the highest for The performance of prediction is in Figure Figure 2 Download asset Open asset of prediction for Abbreviations: maximal oxygen anaerobic threshold; RCP, respiratory compensation point; All values are in performance for running while the performance for cycling equations. performance of the prediction model for for for AT and for somatic-only equation. Table 5 VO2max prediction for cyclists. linear regression group group = = = = based on anaerobic threshold; RCP, based on respiratory compensation point; based on somatic variables mean mean maximal oxygen uptake relative to body mass rVO2AT, oxygen uptake at anaerobic threshold relative to body mass power at anaerobic threshold relative to body mass rVO2RCP, oxygen uptake at respiratory compensation point relative to body mass VERCP, pulmonary ventilation at respiratory compensation point BF, body fat BM, body mass Table VO2max prediction for runners. linear regression group group = = = = based on anaerobic threshold; RCP, based on respiratory compensation point; based on somatic variables mean mean maximal oxygen uptake relative to body mass rVO2AT, oxygen uptake at anaerobic threshold relative to body mass SPEEDAT, velocity at anaerobic threshold FFM, fat mass VEAT, pulmonary ventilation at anaerobic threshold HRAT, heart rate at anaerobic threshold BF, body fat rVO2RCP, oxygen uptake at respiratory compensation point relative to body mass SPEEDRCP, velocity at respiratory compensation point Models validation of each model for cyclists is in Table while for runners in Table the performance of our prediction was to that observed in the

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 56
  • 10.3390/rs13091658
Machine Learning Techniques for Fine Dead Fuel Load Estimation Using Multi-Source Remote Sensing Data
  • Apr 23, 2021
  • Remote Sensing
  • Marina D’Este + 5 more

Fine dead fuel load is one of the most significant components of wildfires without which ignition would fail. Several studies have previously investigated 1-h fuel load using standard fuel parameters or site-specific fuel parameters estimated ad hoc for the landscape. On the one hand, these methods have a large margin of error, while on the other their production times and costs are high. In response to this gap, a set of models was developed combining multi-source remote sensing data, field data and machine learning techniques to quantitatively estimate fine dead fuel load and understand its determining factors. Therefore, the objectives of the study were to: (1) estimate 1-h fuel loads using remote sensing predictors and machine learning techniques; (2) evaluate the performance of each machine learning technique compared to traditional linear regression models; (3) assess the importance of each remote sensing predictor; and (4) map the 1-h fuel load in a pilot area of the Apulia region (southern Italy). In pursuit of the above, fine dead fuel load estimation was performed by the integration of field inventory data (251 plots), Synthetic Aperture Radar (SAR, Sentinel-1), optical (Sentinel-2), and Light Detection and Ranging (LIDAR) data applying three different algorithms: Multiple Linear regression (MLR), Random Forest (RF), and Support Vector Machine (SVM). Model performances were evaluated using Root Mean Squared Error (RMSE), Mean Squared Error (MSE), the coefficient of determination (R2) and Pearson’s correlation coefficient (r). The results showed that RF (RMSE: 0.09; MSE: 0.01; r: 0.71; R2: 0.50) had more predictive power compared to the other models, while SVM (RMSE: 0.10; MSE: 0.01; r: 0.63; R2: 0.39) and MLR (RMSE: 0.11; MSE: 0.01; r: 0.63; R2: 0.40) showed similar performances. LIDAR variables (Canopy Height Model and Canopy cover) were more important in fuel estimation than optical and radar variables. In fact, the results highlighted a positive relationship between 1-h fuel load and the presence of the tree component. Conversely, the geomorphological variables appeared to have lower predictive power. Overall, the 1-h fuel load map developed by the RF model can be a valuable tool to support decision making and can be used in regional wildfire risk management.

  • Research Article
  • 10.1016/j.mlwa.2026.100880
Comparing allometric models to machine learning models for aboveground biomass estimation in agroforestry systems in Kenya
  • Jun 1, 2026
  • Machine Learning with Applications
  • Samuel Irungu Kigotho + 5 more

Comparing allometric models to machine learning models for aboveground biomass estimation in agroforestry systems in Kenya

  • Research Article
  • Cite Count Icon 8
  • 10.3390/toxics13030170
Long-Term Retrospective Predicted Concentration of PM2.5 in Upper Northern Thailand Using Machine Learning Models.
  • Feb 27, 2025
  • Toxics
  • Sawaeng Kawichai + 4 more

This study aims to build, for the first time, a model that uses a machine learning (ML) approach to predict long-term retrospective PM2.5 concentrations in upper northern Thailand, a region impacted by biomass burning and transboundary pollution. The dataset includes PM10 levels, fire hotspots, and critical meteorological data from 1 January 2011 to 31 December 2020. ML techniques, namely multi-layer perceptron neural network (MLP), support vector machine (SVM), multiple linear regression (MLR), decision tree (DT), and random forests (RF), were used to construct the prediction models. The best ML prediction model was selected considering root mean square error (RMSE), mean prediction error (MPE), relative prediction error (RPE) (the lower, the better), and coefficient of determination (R2) (the bigger, the better). Our study found that the ML model-based RF technique using PM10, CO2, O3, fire hotspots, air pressure, rainfall, relative humidity, temperature, wind direction, and wind speed performs the best when predicting the concentration of PM2.5 with an RMSE of 6.82 µg/m3, MPE of 4.33 µg/m3, RPE of 22.50%, and R2 of 0.93. The RF prediction model of PM2.5 used in this research could support further studies of the long-term effects of PM2.5 concentration on human health and related issues.

  • Research Article
  • Cite Count Icon 25
  • 10.1016/j.geoen.2023.211511
Integrating drilling parameters and machine learning tools to improve real-time porosity prediction of multi-zone reservoirs. Case study: Rhourd Chegga oilfield, Algeria
  • Feb 2, 2023
  • Geoenergy Science and Engineering
  • Abdelhamid Ouladmansour + 4 more

Integrating drilling parameters and machine learning tools to improve real-time porosity prediction of multi-zone reservoirs. Case study: Rhourd Chegga oilfield, Algeria

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 19
  • 10.3168/jds.2021-20158
Integrating heterogeneous across-country data for proxy-based random forest prediction of enteric methane in dairy cattle
  • Mar 26, 2022
  • Journal of Dairy Science
  • Enyew Negussie + 18 more

Direct measurements of methane (CH4) from individual animals are difficult and expensive. Predictions based on proxies for CH4 are a viable alternative. Most prediction models are based on multiple linear regressions (MLR) and predictor variables that are not routinely available in commercial farms, such as dry matter intake (DMI) and diet composition. The use of machine learning (ML) algorithms to predict CH4 emissions from across-country heterogeneous data sets has not been reported. The objectives were to compare performances of ML ensemble algorithm random forest (RF) and MLR models in predicting CH4 emissions from proxies in dairy cows, and assess effects of imputing missing data points on prediction accuracy. Data on CH4 emissions and proxies for CH4 from 20 herds were provided by 10 countries. The integrated data set contained 43,519 records from 3,483 cows, with 18.7% missing data points imputed using k-nearest neighbor imputation. Three data sets were created, 3k (no missing records), 21k (missing DMI imputed from milk, fat, protein, body weight), and 41k (missing DMI, milk fat, and protein records imputed). These data sets were used to test scenarios (with or without DMI, imputed vs. nonimputed DMI, milk fat, and protein), and prediction models (RF vs. MLR). Model predictive ability was evaluated within and between herds through 10-fold cross-validation. Prediction accuracy was measured as correlation between observed and predicted CH4, root mean squared error (RMSE) and mean normalized discounted cumulative gain (NDCG). Inclusion of DMI in the model improved within and between-herd prediction accuracy to 0.77 (RMSE = 23.3%) and 0.58 (RMSE = 31.9%) in RF and to 0.50 (RMSE = 0.327) and 0.13 (RMSE = 42.71) in MLR, respectively than when DMI was not included in the predictive model. When missing DMI records were imputed, within and between-herd accuracy increased to 0.84 (RMSE = 18.5%) and 0.63 (RMSE = 29.9%), respectively. In all scenarios, RF models out-performed MLR models. Results suggest routinely measured variables from dairy farms can be used in developing globally robust prediction models for CH4 if coupled with state-of-the-art techniques for imputation and advanced ML algorithms for predictive modeling.

  • Research Article
  • Cite Count Icon 15
  • 10.2166/aqua.2023.016
Analysis of extreme annual rainfall in North-Eastern India using machine learning techniques
  • Nov 27, 2023
  • AQUA — Water Infrastructure, Ecosystems and Society
  • Shivam Agarwal + 2 more

The machine learning techniques of Multiple Linear Regression (MLR), Generalized Additive Models (GAMs), and the Random Forest (RF) Method have been used to analyze the extreme annual rainfall in the six states of Assam, Meghalaya, Tripura, Mizoram, Manipur, and Nagaland in North-Eastern (NE) India. Latitude, longitude, altitude, and temperature were the covariates that were used in this study. Ordinary Kriging was used to interpolate the predicted outcomes of each dataset. Statistical metrics like Mean Absolute Errors (MAE), Root Mean Square Error (RMSE), Coefficients of Determination (COD-R2), and Nash–Sutcliffe Efficiency (NSE) were also assessed. When compared to satellite rainfall data, all techniques performed significantly better for ground rainfall data. For prediction, GAM's predicted rainfall values triumph over MLR or RF. RF ranks a close second, while the linearity of MLR prohibits it from making precise predictions for a physical phenomenon like rainfall. The MAE and RMSE of GAM forecasts are significantly lower than those of MLR and RF in most circumstances. Additionally, the COD and NSE of GAM predictions are significantly better than both MLR and RF in most cases, showing that GAM, out of MLR, GAM, and RF, is the best model for predicting rain in our research area.

  • Research Article
  • Cite Count Icon 7
  • 10.3390/su162411077
Prediction of Potential Evapotranspiration via Machine Learning and Deep Learning for Sustainable Water Management in the Murat River Basin
  • Dec 17, 2024
  • Sustainability
  • Ibrahim A Hasan + 1 more

Potential evapotranspiration (PET) is a significant factor contributing to water loss in hydrological systems, making it a critical area of research. However, accurately calculating and measuring PET remains challenging due to the limited availability of comprehensive data. This study presents a detailed sustainable model for predicting PET using the Thornthwaite equation, which requires only mean monthly temperature (Tmean) and latitude, with calculations performed using R-Studio. A geographic information system (GIS) was employed to interpolate meteorological data, ensuring coverage of all sub-basins within the Murat River basin, the study area. Additionally, Python libraries were utilized to implement artificial intelligence-driven models, incorporating both machine learning and deep learning techniques. The study harnesses the power of artificial intelligence (AI), applying deep learning through a convolutional neural network (CNN) and machine learning techniques, including support vector machine (SVM) and random forest (RF). The results demonstrate promising performance across the models. For CNN, the coefficient of determination (R2) varied from 96.2 to 98.7%, the mean squared error (MSE) ranged from 0.287 to 0.408, and the root mean squared error (RMSE) was between 0.541 and 0.649. For SVM, the R2 varied from 94.5 to 95.6%, MSE ranged between 0.981 and 1.013, and RMSE ranged from 0.990 to 1.014. RF showed the best performance, achieving an R2 of 100%, MSE values of 0.326 and 0.640, and corresponding RMSE values of 0.571 and 0.800. The climate and topography data used for all algorithms were consistent, and the results indicate that the RF model outperforms the others. Consequently, The RF model’s superior accuracy highlights its potential as a reliable tool for sustainable PET prediction, supporting informed decision-making in water resource planning. By leveraging GIS, AI, and machine learning, this study enhances PET modeling methodologies, addressing critical water management challenges and promoting sustainable hydrological practices in the face of climate change and resource limitations.

  • Front Matter
  • Cite Count Icon 37
  • 10.1093/bioinformatics/btr585
The rise and fall of supervised machine learning techniques
  • Dec 5, 2011
  • Bioinformatics
  • Lars Juhl Jensen + 1 more

Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larranaga et al., 2006; Tarca et al., 2007). In particular, supervised machine learning has been used to great effect in numerous bioinformatics prediction methods. Through many years of editing and reviewing manuscripts, we noticed that some supervised machine learning techniques seem to be gaining in popularity while others seemed, at least to our eyes, to be looking ‘unfashionable’. We were motivated to create a league table of machine learning techniques to learn what is hot and what is not in the machine learning field. In this editorial, we only include those that we considered major league and leave analysis of the minor league methods as an exercise for the interested reader. To create our league table, we created a list of supervised machine learning techniques commonly used in bioinformatics and their common synonyms, plural forms and abbreviations. We then searched this list against the PubMed titles and abstracts to identify the number of papers published per year for each machine learning technique. To match as many papers as possible, searches were case insensitive and allowed for variation in hyphenation. To our surprise, the artificial neural network (ANN) is not only the dominant league leader in 2011 but has been in this position since at least the 1970s (see Fig. 1). However, in recent years the usage of support vector machines (SVMs) grew tremendously, and we predict that SVMs will challenge ANNs for the dominant position in the coming decade. Since 2007 the number of publications using ANNs has decreased by 21%, which we hypothesize may be directly attributed to researchers increasingly using SVMs in place of ANNs. SVMs caught up with and overtook Markov models in 2004 to gain second spot in our machine learning league. Fig. 1. The growth of supervised machine learning methods in PubMed. As for the question of ‘what is hot?’, one can see that Random forests are a rapidly growing method with not a single mention of them before 2003 and now a total of 407 papers published to date. We were hoping to find techniques that were not so hot and perhaps going out of fashion. The results show that none of the major league methods has gone out of fashion, but we do see moderate decreases in the use of both ANNs and Markov models in the literature. We were also curious to find out if certain machine learning techniques were used in combination with each other. To investigate this, we looked at what machine learning methods are co-mentioned in articles (See Fig. 2). For all pairs of methods from the Supervised Machine Learning Top-5, we counted the number of abstracts that mention both methods and normalized the counts with the number of co-occurrences that would be expected by chance (based on the frequencies with which the methods are mentioned over the years). The strongest correlation (185 times higher than random expectation) is seen between decision trees and random forests, which is to be expected as random forests are ensembles of decision trees. Apart from this, the next strongest correlation (88 times higher than random expectation) is found between the two newest methods on the list, namely SVMs and random forests. We hypothesize that this is due to many researchers using these algorithms through machine learning frameworks such as Weka (Frank et al., 2004), which allows many different algorithms to easily be applied to the same dataset. Fig. 2. Heatmap showing the co-occurrence of machine learning techniques within articles. Applications of supervised machine learning methodology continue to grow in the biomedical literature. Despite new methods growing in usage, for example support vector machines and random forests, we see little evidence that any widely adopted methods are falling out of use. Conflict of Interest: none declared.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.7494/geol.2023.49.3.245
An advanced ensemble modeling approach for predicting carbonate reservoir porosity from seismic attributes
  • Sep 6, 2023
  • Geology, Geophysics and Environment
  • Tomasz Topór + 1 more

This study uses a machine learning (ML) ensemble modeling approach to predict porosity from multiple seismic attributes in one of the most promising Main Dolomite hydrocarbon reservoirs in NW Poland. The presented workflow tests five different model types of varying complexity: K-nearest neighbors (KNN), random forests (RF), extreme gradient boosting (XGB), support vector machine (SVM), single layer neural network with multilayer perceptron (MLP). The selected models are additionally run with different configurations originating from the pre-processing stage, including Yeo–Johnson transformation (YJ) and principal component analysis (PCA). The race ANOVA method across resample data is used to tune the best hyperparameters for each model. The model candidates and the role of different pre-processors are evaluated based on standard ML metrics – coefficient of determination (R2), root mean squared error (RMSE), and mean absolute error (MAE). The model stacking is performed on five model candidates: two KNN, two XGB, and one SVM PCA with a marginal role. The results of the ensemble model showed superior accuracy over single learners, with all metrics (R2 0.890, RMSE 0.0252, MAE 0.168). It also turned out to be almost three times better than the neural net (NN) results obtained from commercial software on the same testing set (R2 0.318, RMSE 0.0628, MAE 0.0487). The spatial distribution of porosity from the ensemble model indicated areas of good reservoir properties that overlap with hydrocarbon production fields. This observation completes the evaluation of the ensemble technique results from model metrics. Overall, the proposed solution is a promising tool for better porosity prediction and understanding of heterogeneous carbonate reservoirs from multiple seismic attributes.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 14
  • 10.3390/polym15204057
Prediction of Tribological Properties of UHMWPE/SiC Polymer Composites Using Machine Learning Techniques.
  • Oct 11, 2023
  • Polymers
  • Abdul Jawad Mohammed + 2 more

Polymer composites are a class of material that are gaining a lot of attention in demanding tribological applications due to the ability of manipulating their performance by changing various factors, such as processing parameters, types of fillers, and operational parameters. Hence, a number of samples under different conditions need to be repeatedly produced and tested in order to satisfy the requirements of an application. However, with the advent of a new field of triboinformatics, which is a scientific discipline involving computer technology to collect, store, analyze, and evaluate tribological properties, we presently have access to a variety of high-end tools, such as various machine learning (ML) techniques, which can significantly aid in efficiently gauging the polymer's characteristics without the need to invest time and money in a physical experimentation. The development of an accurate model specifically for predicting the properties of the composite would not only cheapen the process of product testing, but also bolster the production rates of a very strong polymer combination. Hence, in the current study, the performance of five different machine learning (ML) techniques is evaluated for accurately predicting the tribological properties of ultrahigh molecular-weight polyethylene (UHMWPE) polymer composites reinforced with silicon carbide (SiC) nanoparticles. Three input parameters, namely, the applied pressure, holding time, and the concentration of SiCs, are considered with the specific wear rate (SWR) and coefficient of friction (COF) as the two output parameters. The five techniques used are support vector machines (SVMs), decision trees (DTs), random forests (RFs), k-nearest neighbors (KNNs), and artificial neural networks (ANNs). Three evaluation statistical metrics, namely, the coefficient of determination (R2-value), mean absolute error (MAE), and root mean square error (RMSE), are used to evaluate and compare the performances of the different ML techniques. Based upon the experimental dataset, the SVM technique was observed to yield the lowest error rates-with the RMSE being 2.09 × 10-4 and MAE being 2 × 10-4 for COF and for SWR, an RMSE of 2 × 10-4 and MAE of 1.6 × 10-4 were obtained-and highest R2-values of 0.9999 for COF and 0.9998 for SWR. The observed performance metrics shows the SVM as the most reliable technique in predicting the tribological properties-with an accuracy of 99.99% for COF and 99.98% for SWR-of the polymer composites.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant