Unraveling the sensory metabolome of blueberries: An integrated metabolomics and machine learning approach across cultivars and geographical origins.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Unraveling the sensory metabolome of blueberries: An integrated metabolomics and machine learning approach across cultivars and geographical origins.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 12
  • 10.3390/en15093242
The Application of Machine Learning Methods to Predict the Power Output of Internal Combustion Engines
  • Apr 28, 2022
  • Energies
  • Ruomiao Yang + 2 more

The indicated mean effective pressure (IMEP) is a key parameter for measuring the power output of an internal combustion engine (ICE). This indicator can be used to locate the high efficiency regions of engines. Therefore, it makes sense to predict the IMEP based on the machine learning (ML) approaches. However, different ML models are applicable to different scenarios, so it is important to choose the right model for prediction. The objective of this paper was to compare three ML models’ (ANN, SVR, RF) predictive performance in forecasting IMEP indicator with the input parameters spark timing (ST), speed and load. A validated one-dimensional (1D) computational fluid dynamics (CFD) model was employed to provide 756 sets of data for the training, validation, and testing of the model. The results indicated that the random forest (RF) model had the worst prediction performance, and support vector regression (SVR) had a slightly better prediction performance than the artificial neural network (ANN), at least for the investigations in this study. Overall, the ANN and SVR models showed good predictive performance for IMEP, as the coefficient of determination (R2) was close to unity, and the root mean squared error (RMSE) was close to zero. Whereas the overall prediction results of the RF model are acceptable, the RF model does not learn well for some internal engine laws.

  • Research Article
  • Cite Count Icon 10
  • 10.1177/03611981221128812
Traffic Conflict Prediction at Signal Cycle Level Using Bayesian Optimized Machine Learning Approaches
  • Oct 29, 2022
  • Transportation Research Record: Journal of the Transportation Research Board
  • Lai Zheng + 2 more

This study develops non-parametric models to predict traffic conflicts at signalized intersections at the signal cycle level using machine learning approaches. Three different datasets were collected, one from Surrey, Canada, and the other two from Los Angeles and Georgia, U.S.A. From the datasets, traffic conflicts measured by modified time to collision and traffic parameters such as traffic volume, shockwave area, platoon ratio, and shockwave speed were extracted. Multilayer perceptron (MLP), support vector regression (SVR), and random forest (RF) models were developed based on the Surrey dataset, and the Bayesian optimization approach was adopted to optimize the model hyperparameters. The optimized models were applied to the Los Angeles and Georgia datasets to test their transferability, and they were also compared to a traditional safety performance function (SPF) developed using negative binominal regression. The results show that all the three Bayesian optimized machine learning models have high predictive accuracy and acceptable transferability, and the MLP model is a little better than the SVR and RF models. In addition, all three models outperform the traditional SPF with regard to predictive accuracy. The model sensitivity analysis also show that the traffic volume and shockwave area have positive effects on traffic conflicts, while the platoon ratio has negative effects.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 13
  • 10.3390/rs14122800
Comparison of Machine Learning-Based Snow Depth Estimates and Development of a New Operational Retrieval Algorithm over China
  • Jun 10, 2022
  • Remote Sensing
  • Jianwei Yang + 6 more

Snow depth estimation with passive microwave (PM) remote sensing is challenged by spatial variations in the Earth’s surface, e.g., snow metamorphism, land cover types, and topography. Thus, traditional static snow depth retrieval algorithms cannot capture snow thickness well. In this study, we present a new operational retrieval algorithm, hereafter referred to as the pixel-based method (0.25° × 0.25° grid-level), to provide more accurate and nearly real-time snow depth estimates. First, the reference snow depth was retrieved using a previously proposed model in which a microwave snow emission model was coupled with a machine learning (ML) approach. In this process, an effective grain size (effGS) value was optimized by utilizing the snow microwave emission model, and then the nonlinear relationship between snow depth and multiple predictive variables, e.g., effGS, longitude, elevation, and brightness temperature (Tb) gradients, was established with the ML technique to retrieve reference snow depth data. To select a robust and well-performing ML approach, we compared the performance of widely used support vector regression (SVR), artificial neural network (ANN) and random forest (RF) algorithms over China. The results show that the three ML models performed similarly in snow depth estimation, which was attributed to the inclusion of effGS in the training samples. In this study, the RF model was used to retrieve the snow depth reference dataset due to its slightly stronger robustness according to our comparison of results. Second, the pixel-based algorithm was built based on the retrieved reference snow depth dataset and satellite Tb observations (18.7 GHz and 36.5 GHz) from Advanced Microwave Scanning Radiometer 2 (AMSR2) during the 2012–2020 period. For the pixel-based algorithm, the fitting coefficients were achieved dynamically pixel by pixel, making it superior to the traditional static methods. Third, the built pixel-based algorithm was verified using ground-based observations and was compared to the AMSR2, GlobSnow-v3.0, and ERA5-land products during the 2012–2020 period. The pixel-based algorithm exhibited an overall unbiased root mean square error (unRMSE) and R2 of 5.8 cm and 0.65, respectively, outperforming GlobSnow-v3.0, with unRMSE and R2 values of 9.2 cm and 0.22, AMSR2, with unRMSE and R2 values of 18.5 cm and 0.13, and ERA5-land, with unRMSE and R2 values of 10.5 cm and 0.33, respectively. However, the pixel-based algorithm estimates were still challenged by the complex terrain, e.g., the unRMSE was up to 17.4 cm near the Tien Shan Mountains. The proposed pixel-based algorithm in this study is a simple and operational method that can retrieve accurate snow depths based solely on spaceborne PM data in comparatively flat areas.

  • Research Article
  • Cite Count Icon 2
  • 10.1007/s44187-024-00253-x
Leveraging machine learning techniques to analyze nutritional content in processed foods
  • Dec 19, 2024
  • Discover Food
  • K A Muthukumar + 2 more

The global shift towards plant-based diets, particularly in India, is driven by environmental and ethical considerations. While plant foods are often regarded as more sustainable, concerns persist regarding protein quality, especially after processing. With protein deficiencies being prevalent among Indians, it is crucial to understand the impact of food processing on nutrient retention. This research integrates machine learning with food science to develop a comprehensive AI framework for forecasting the protein content of various plant-based sources following both traditional and non-conventional processing methods. A robust database was compiled using sources such as Web of Science, Scopus, PubMed, and Google Scholar, covering a wide range of plant-based foods and their protein content before and after processing. After data preprocessing, two primary machine learning algorithms were employed: Support Vector Regression (SVR) and Random Forest (RF), both implemented using Scikit-learn. The SVR model was optimized to identify the best-fitting hyperplane in high-dimensional space, while the RF model utilized GridSearchCV for hyperparameter tuning and performed a “Feature Importance Analysis” to identify key factors influencing the outcomes. Model performance was evaluated using Normalized Mean Squared Error (NMSE) as the evaluation metric. The results indicated that the RF model achieved an NMSE of approximately 0.35, reflecting a moderate level of prediction error relative to data variance. In contrast, the SVR model significantly outperformed the RF model, with an NMSE of approximately 0.03, demonstrating superior accuracy and efficiency in predicting nutrient retention. This study leverages machine learning to bridge a critical gap in understanding nutrient retention in plant-based foods during processing. The findings reveal that the SVR model is particularly effective in predicting nutrient retention, outperforming the RF model. This novel approach holds significant potential to optimize nutrient retention in plant-based food products, offering important implications for public health and food quality.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.3390/agronomy13041079
A Study of a Model for Predicting Pneumatic Subsoiling Resistance Based on Machine Learning Techniques
  • Apr 7, 2023
  • Agronomy
  • Xia Li + 5 more

In order to explore the drag reduction mechanism of pneumatic subsoiling and study the influence of pneumatic subsoiling on the soil, this study used machine learning models to predict the working resistance of a pneumatic subsoiler and adopted random forest (RF), error back-propagation (BP), eXtreme gradient boosting (XGBoost) and support vector regression (SVR) to analyze and compare the predictions of these four models. Field experiments were carried out in two fields with different bulk densities and moisture content. The effects of these parameters on the resistance of pneumatic subsoiling were studied by changing the working air pressure, depth and forward speed. In the RF, SVR, XGBoost and BP models, five parameters (working air pressure, working depth, forward speed, bulk density and moisture content) were inputted as independent variables, and the operating resistance of pneumatic subsoiling was used as the predicted value. After training the four models, the results showed that the R2 value of the RF model was the highest and the error was the smallest, which made it better than the SVR, XGBoost and BP models. The values of MAPE, R2 and RMSE for the RF model’s test set were 0.01, 0.99, and 3.61 N, respectively, indicating that the RF model could predict the resistance value of subsoiling well. When the RF model was used to analyze the five input parameters, the experimental results showed that the contribution of working air pressure to reducing the resistance of subsoiling reached 29%, indicating that pneumatic subsoiling can reduce the resistance, drag and consumption.

  • Research Article
  • Cite Count Icon 16
  • 10.1016/j.conbuildmat.2023.130321
Optimized machine learning approaches for identifying vertical temperature gradient on ballastless track in natural environments
  • Jan 16, 2023
  • Construction and Building Materials
  • Tao Shi + 1 more

Optimized machine learning approaches for identifying vertical temperature gradient on ballastless track in natural environments

  • Research Article
  • Cite Count Icon 14
  • 10.5664/jcsm.9630
Obstructive sleep apnea predicts 10-year cardiovascular disease-related mortality in the Sleep Heart Health Study: a machine learning approach.
  • Aug 26, 2021
  • Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine
  • Ao Li + 3 more

Obstructive sleep apnea (OSA) is considered to be an important risk factor for the development of cardiovascular disease (CVD). This study aimed to develop and evaluate a machine learning approach with a set of features for assessing the 10-year CVD mortality risk of the OSA population. This study included 2,464 patients with OSA who met study inclusion criteria and were selected from the Sleep Heart Health Study. We evaluated the importance of potential features by mutual information. The top 9 features were selected to develop a random forest model. We evaluated the model performance on a test set (n = 493) using the area under the receiver operating curve with 95% confidence interval and confusion matrix. A random forest model awarded the highest area under the receiver operating curve of 0.84 (95% confidence interval: 0.78-0.89). The specificity and sensitivity were 73.94% and 81.82%, respectively. Sixty-three years old was a threshold for increased risk of 10-year CVD mortality. Persons with severe OSA had higher risk than those with mild OSA. This study demonstrated that a random forest model can provide a quick assessment of the risk of 10-year CVD mortality. Our model may be more informative for patients with OSA in determining their future CVD mortality risk. Li A, Roveda JM, Powers LS, Quan SF. Obstructive sleep apnea predicts 10-year cardiovascular disease-related mortality in the Sleep Heart Health Study: a machine learning approach. J Clin Sleep Med. 2022;18(2):497-504.

  • Research Article
  • Cite Count Icon 1
  • 10.20879/acr.2022.19.3.101
When Machine Learning Meets Social Science: A Comparative Study of Ordinary Least Square, Stochastic Gradient Descent, and Support Vector Regression for Exploring the Determinants of Behavioral Intentions to Tuberculosis Screening
  • Dec 30, 2022
  • Asian Communication Research
  • Dayeoun Jang + 1 more

Regression analysis is one of the most widely utilized methods because of its adaptability and simplicity. Recently, the machine learning (ML) approach, which is one aspect of regression methods, has been gaining attention from researchers, including social science, but there are only a few studies that compared the traditional approaches with the ML approach. This study was conducted to explore the usefulness of the ML approach by comparing the ordinary least square estimate (OLS), the stochastic gradient descent algorithm (SGD), and the support vector regression (SVR) with a model predicting and explaining the tuberculosis screening intention. The optimized models were evaluated by four aspects: computational speed, effect and importance of individual predictor, and model performance. The result demonstrated that each model yielded a similar direction of effect and importance in each predictor, and the SVR with the radial kernel had the finest model performance compared to its computational speed. Finally, this study discussed the usefulness and attentive points of the ML approach when a researcher utilizes it in the field of communication.

  • Research Article
  • Cite Count Icon 3
  • 10.38016/jista.922663
Estimation of High School Entrance Examination Success Rates Using Machine Learning and Beta Regression Models
  • Mar 15, 2022
  • Journal of Intelligent Systems: Theory and Applications
  • Tuba Koc + 1 more

Education is the foundation of economic, social, and cultural development for every individual and society as a whole. Students are accepted to secondary education institutions with the high school entrance examination made by the Ministry of National Education in Turkey. In this study, the success rates of the students who took the high school entrance examination in Turkey's 81 provinces in 2019 were handled with the machine learning regression and beta regression model. The present paper aimed to model, predict, and explain students' success rates using variables such as divorce rate, gross domestic product, illiteracy, and higher education populations. Support vector regression, random forest, decision tree, and beta regression model were applied to estimate success rates. Two models with the highest R2 value were found to be beta regression and random forest models. When the prediction errors of beta regression and random forest model were examined, it seemed to be that the random forest model is relatively superior to the beta regression model in predicting the success rates. While the beta regression model was the best predictor of the success rates of Çanakkale province, the random forest model predicted the success rates of Ankara well. Also, it was seen that the variables found to be significant in the beta regression model for success rates were also crucial in the random forest model. It is recommended to use both the beta and random forest models to estimate the students' success rates.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 36
  • 10.3389/fpsyt.2021.626677
Improving Individual Brain Age Prediction Using an Ensemble Deep Learning Framework
  • Mar 23, 2021
  • Frontiers in Psychiatry
  • Chen-Yuan Kuo + 9 more

Brain age is an imaging-based biomarker with excellent feasibility for characterizing individual brain health and may serve as a single quantitative index for clinical and domain-specific usage. Brain age has been successfully estimated using extensive neuroimaging data from healthy participants with various feature extraction and conventional machine learning (ML) approaches. Recently, several end-to-end deep learning (DL) analytical frameworks have been proposed as alternative approaches to predict individual brain age with higher accuracy. However, the optimal approach to select and assemble appropriate input feature sets for DL analytical frameworks remains to be determined. In the Predictive Analytics Competition 2019, we proposed a hierarchical analytical framework which first used ML algorithms to investigate the potential contribution of different input features for predicting individual brain age. The obtained information then served as a priori knowledge for determining the input feature sets of the final ensemble DL prediction model. Systematic evaluation revealed that ML approaches with multiple concurrent input features, including tissue volume and density, achieved higher prediction accuracy when compared with approaches with a single input feature set [Ridge regression: mean absolute error (MAE) = 4.51 years, R2 = 0.88; support vector regression, MAE = 4.42 years, R2 = 0.88]. Based on this evaluation, a final ensemble DL brain age prediction model integrating multiple feature sets was constructed with reasonable computation capacity and achieved higher prediction accuracy when compared with ML approaches in the training dataset (MAE = 3.77 years; R2 = 0.90). Furthermore, the proposed ensemble DL brain age prediction model also demonstrated sufficient generalizability in the testing dataset (MAE = 3.33 years). In summary, this study provides initial evidence of how-to efficiency for integrating ML and advanced DL approaches into a unified analytical framework for predicting individual brain age with higher accuracy. With the increase in large open multiple-modality neuroimaging datasets, ensemble DL strategies with appropriate input feature sets serve as a candidate approach for predicting individual brain age in the future.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 41
  • 10.3390/rs12091470
A Comparison of Estimating Crop Residue Cover from Sentinel-2 Data Using Empirical Regressions and Machine Learning Methods
  • May 6, 2020
  • Remote Sensing
  • Yanling Ding + 6 more

Quantifying crop residue cover (CRC) on field surfaces is important for monitoring the tillage intensity and promoting sustainable management. Remote-sensing-based techniques have proven practical for determining CRC, however, the methods used are primarily limited to empirical regression based on crop residue indices (CRIs). This study provides a systematic evaluation of empirical regressions and machine learning (ML) algorithms based on their ability to estimate CRC using Sentinel-2 Multispectral Instrument (MSI) data. Unmanned aerial vehicle orthomosaics were used to extracted ground CRC for training Sentinel-2 data-based CRC models. For empirical regression, nine MSI bands, 10 published CRIs, three proposed CRIs, and four mean textural features were evaluated using univariate linear regression. The best performance was obtained by a three-band index calculated using (B2 − B4)/(B2 − B12), with an R2cv of 0.63 and RMSEcv of 6.509%, using a 10-fold cross-validation. The methodologies of partial least squares regression (PLSR), artificial neural network (ANN), Gaussian process regression (GPR), support vector regression (SVR), and random forest (RF) were compared with four groups of predictors, including nine MSI bands, 13 CRIs, a combination of MSI bands and mean textural features, and a combination of CRIs and textural features. In general, ML approaches achieved high accuracy. A PLSR model with 13 CRIs and textural features resulted in an accuracy of R2cv = 0.66 and RMSEcv = 6.427%. An RF model with predictors of MSI bands and textural features estimated CRC with an R2cv = 0.61 and RMSEcv = 6.415%. The estimation was improved by an SVR model with the same input predictors (R2cv = 0.67, RMSEcv = 6.343%), followed by a GPR model based on CRIs and textural features. The performance of GPR models was further improved by optimal input variables. A GPR model with six input variables, three MSI bands and three textural features, performed the best, with R2cv = 0.69 and RMSEcv = 6.149%. This study provides a reference for estimating CRC from Sentinel-2 imagery using ML approaches. The GPR approach is recommended. A combination of spectral information and textural features leads to an improvement in the retrieval of CRC.

  • Preprint Article
  • Cite Count Icon 1
  • 10.5194/egusphere-egu2020-4233
Using a boundary-corrected wavelet transform coupled with machine learning and hybrid deep learning approaches for multi-step water level forecasting in Lakes Michigan and Ontario
  • Mar 23, 2020
  • Rahim Barzegar + 3 more

<p>Accurate water level (WL) forecasting is important for water resources management and planning purposes in the Great Lakes. The objectives of this research are two-fold.  The first objective is to apply machine learning (ML) (i.e., random forest (RF) and support vector regression (SVR)) and hybrid convolutional neural network(CNN)-long-short term memory (LSTM) deep learning (DL) models for multi-step (i.e., one-, two- and three-monthly step ahead) WL forecasting in the Great Lakes (Michigan and Ontario). The second objective is to integrate the boundary corrected (BC) maximal overlap discrete wavelet transform (MODWT) with SVR, RF, and CNN-LSTM models to improve the performance of the individual models. By employing a BC-wavelet decomposition method, the ‘future data’ issue (i.e., data from the future that is not available), often overlooked in the literature and a major barrier to achieving realistic forecasting performance is overcome. </p><p>For Lakes Michigan and Ontario, 1212 monthly WL (m) records (spanning Jan 1918–Dec 2018) were used to develop the models. For the non-wavelet-based models (SVR, RF, and CNN-LSTM), candidate model inputs included the WL recorded over the previous 12 months.  For the BC-MODWT-based models (BC-MODWT-SVR, BC-MODWT-RF, and BC-MODWT-CNN-LSTM), the lagged input time series were decomposed into BC-wavelet and scaling coefficients by using different mother wavelets (Haar, Daubechies, Symlets, Fejer-Korovkin and Coiflets), filter lengths (from two up to 12) and decomposition levels (from one up to seven).  For each method (SVR, RF, and CNN-LSTM), mother wavelet, and decomposition level a model was generated.  For both wavelet- and non-wavelet-based models, the particle swarm optimization (PSO) method was used to select the most appropriate inputs to include in the proposed multi-step WL forecasting models.</p><p>The datasets were partitioned into calibration and validation subsets. After calibrating the models, various performance evaluation metrics, e.g., coefficient of determination (R<sup>2</sup>), root mean square error (RMSE), mean absolute error (MAE), root mean square percentage error (RMSPE), mean absolute percentage error (MAPE) and the Nash-Sutcliffe efficiency coefficient (NSC) were used to assess model accuracy.</p><p>Of the ML models, the SVR outperformed RF while the DL models outperformed the ML models for each forecast lead time (one-, two-, and three-step(s) ahead). Results from this case study indicate that not all wavelet families and decomposition levels perform equally and in some cases, the wavelet-based models do not improve performance over the non-wavelet-based models. However, the BC-MODWT-CNN-LSTM using suitable mother wavelets (e.g., Haar) outperforms the individual ML and BC-MODWT-ML-based models. More accurate forecasts were obtained for Lake Michigan although the performance in both Great Lakes was accurate. The outcomes of this research indicate that the BC-MODWT-CNN-LSTM model is a promising tool for generating accurate WL forecasts.</p>

  • Conference Article
  • Cite Count Icon 2
  • 10.2523/iptc-23110-ea
Improved Reservoir Rock Porosity Prediction from Advanced Mud Gas Data
  • Feb 12, 2024
  • S Badawood + 1 more

In our continued effort to extend the utility of advanced mud gas (AMG) data from the traditional fluid typing to reservoir rock properties prediction, this study investigates the feasibility of predicting the full porosity log for the hydrocarbon-bearing zones of wells ahead of wireline logging and core analysis processes. Our previous incremental results have confirmed the successful prediction of missing porosity logs in an interval within the borehole and in a section of a field. We established the linear correlation between porosity and Total Gas (TG) to confirm the hypothesis. Leveraging the capability of machine learning (ML) algorithms to recognize hidden patterns in data, we developed artificial neural network (ANN), decision trees (DT), and random forest (RF) models. We collected over 20,000 data points from representative wells in the study area, used 90% for training and optimizing the models, 10% for testing, and five wells for blind validation. A cut-off of 500 ppm was applied on the total gas to remove background gas effects and focus on the hydrocarbon-bearing zones. Using statistical model performance evaluation metrics comprising correlation coefficient (R2) and mean squared error (MSE), we compared the results of the ANN, DT, and RF models. The RF model consistently outperformed the others based on the training, testing, and validation metrics. Using the original AMG data, the least-performing RF model gave an R2 value of 0.78 for training, 0.76 for validation, and MSE of 0.014 for full-well blind testing. After applying the cut-off, the performance of all the models improved significantly, while the RF model maintained its best performance. With this improvement, the least-performing RF model gave an R2 value of 0.98 for training, 0.89 for validation, and MSE of 0.003 for full-well blind testing. Considering the outcome of our previous studies, these results have further confirmed the robustness of nonlinear solutions based on the ML methodology. It can be concluded that the ML approach for predicting reservoir rock porosity from AMG data acquired in real time is feasible, though with room for improvement. This study also confirms the benefit of focusing on the productive zone by applying a cut-off on the TG. This study will contribute to the objectives of digital transformation in the petroleum exploration industry by (1) expanding the utility of existing data without extra cost, (2) utilizing real-time data such as the AMG to predict rock properties in real time for better decision, (3) providing more information to optimize reservoir contact while drilling, and (4) providing information to determine reservoir quality at the early stage of well development.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/w17030359
Predicting Post-Wildfire Stream Temperature and Turbidity: A Machine Learning Approach in Western U.S. Watersheds
  • Jan 27, 2025
  • Water
  • Junjie Chen + 1 more

Wildfires significantly impact water quality in the Western United States, posing challenges for water resource management. However, limited research quantifies post-wildfire stream temperature and turbidity changes across diverse climatic zones. This study addresses this gap by using Random Forest (RF) and Support Vector Regression (SVR) models to predict post-wildfire stream temperature and turbidity based on climate, streamflow, and fire data from the Clackamas and Russian River Watersheds. We selected Random Forest (RF) and Support Vector Regression (SVR) because they handle non-linear, high-dimensional data, balance accuracy with efficiency, and capture complex post-wildfire stream temperature and turbidity dynamics with minimal assumptions. The primary objectives were to evaluate model performance, conduct sensitivity analyses, and project mid-21st century water quality changes under Representative Concentration Pathway (RCP) 4.5 and 8.5 scenarios. Sensitivity analyses indicated that 7-day maximum air temperature and discharge were the most influential predictors. Results show that RF outperformed SVR, achieving an R2 of 0.98 and root mean square error of 0.88 °C for stream temperature predictions. Post-wildfire turbidity increased up to 70 NTU during storm events in highly burned subwatersheds. Under RCP 8.5, stream temperatures are projected to rise by 2.2 °C by 2050. RF’s ensemble approach captured non-linear relationships effectively, while SVR excelled in high-dimensional datasets but struggled with temporal variability. These findings underscore the importance of using machine learning for understanding complex post-fire hydrology. We recommend adaptive reservoir operations and targeted riparian restoration to mitigate warming trends. This research highlights machine learning’s utility for predicting post-wildfire impacts and informing climate-resilient water management strategies.

  • Research Article
  • Cite Count Icon 1
  • 10.46717/igj.57.2e.2ms-2024-11-11
Random Forest and Decision Tree Facies Classification Models for Well Log Data of the Mishrif Formation from Basrah Oil Company, Southern Iraq
  • Nov 29, 2024
  • The Iraqi Geological Journal
  • Ahmed Bichan + 1 more

Facies collected from wells drilled in the study area were interpreted manually by using cores at every 10 meters of depth during well drilling. This depth of cores does not give true facies of all wells because the cores every 10 meters are considered very large. Extracting cores is financially expensive and takes a long time. The methodology of machine learning consists of four steps (Data gathering, Data preprocessing, Model training, and Model evaluation). This work intends to apply two of the supervised machine learning techniques random forest and decision tree models. (1) Data gathering, this dataset was collected from the Basrah Oil Company. It contains of ten wells (B-3, B-4, B-5, B-15, B-17, B-18, B-19, B-34, B-39, and B-40). Every well contains six features (logs): Sonic log, Resistivity Deep, Micro Spherically Focused Log, Neutron porosity, Density log, and Gamma-ray. Also, the dataset contains ten facies labels: mudstone, wack stone, packstone, roundstone, floatstone, shale, mud and wack stone, wack and pack stone, pack and grain stone and pack and float stone. These logs cover all the thickness of the Mishrif Formation, which is the goal of our study. (2) Preprocess, the data must be cleaned of outlier values; these values reduce the accuracy of the model during training. It is necessary at this stage to understand the relationships between all the features, because the highly correlated relationship between any two features, the more useless it will be in machine learning. The well B-5 blinded it for training to demonstrate the ability of machine learning models to predict lithofacies, then splitting randomly the dataset into 70% for training, validation 10%, and 20% for testing to verify the performance. (3) Training two models machine learning models decision tree and random forest. (4) Four statistics are computed for two models from the confusion matrix (accuracy of classification, recall, precision, and F1-score) showing the random forest was more accurate than the decision tree because the random forest model deals very well with this amount of dataset. Receiver Operating Characteristics curves of the random forest model have obtained the largest Area Under the Curve than the decision tree, is positive and above the main diagonal for all lithotypes, and the values for all classes reached more than 95%, except class 6 (shale), because of the 100% accuracy of classification. Facies classification by machine learning approach has two benefits (1) Increased accuracy of describing oil reservoirs and (2) Reducing the time, which a geologist needs to interpret logs data.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon