Прогнозирование банковских продаж на примере ПАО «Сбербанк»

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Introduction. This scientific study highlights the relevance of modeling and forecasting sales of Sberbank in terms of effective business management. Sales forecast is an important tool for predicting the demand for goods and services, as well as determining the adequate strategies and tactics to achieve the company’s goals. The research is distinguished by its reference to artificial intelligence methods in the field of marketing. Forecasting methods applied to a proprietary data sample of Sberbank’s daily sales give novel results, which reliably supports the development of adequate strategies and tactics for successful business management. The key hypothesis of the study is to check the prognostic potential of machine learning methods against the traditional econometric approaches to modeling Sberbank’s sales. The purpose of the study is to develop sales forecasting models for multifunctional products and their practical instruments for Sberbank’s Sales Network Block. Materials and Methods. The study relies on the methods of system-oriented analysis, statistical and economic mathematical methods of data analysis and their processing. Collected and pre-processed sales data for Sberbank’s phantom products reflecting the dynamics of bank sales were used for computational experiments to build a few forecasting models and justify the choice of the best model among those built. Results. Random Forest and Gradient Busting (XGBRegressor) Models used training and test samples to give the forecasts with the accuracy significantly higher than the accuracy of forecasts by ARIMA-model and linear regression. Conclusions. The results of the analysis reliably confirm that machine learning methods are currently promising methods for forecasting bank sales and can be the subject of further research in this area. Machine learning techniques introduced into banking practices have the potential to significantly improve the effectiveness of existing sales and risk management.

Similar Papers
  • Research Article
  • Cite Count Icon 28
  • 10.1097/tp.0000000000002923
Seeing the Forest for the Trees: Random Forest Models for Predicting Survival in Kidney Transplant Recipients.
  • May 1, 2020
  • Transplantation
  • Ruth Sapir-Pichhadze + 1 more

Risk prediction plays an important role in clinical transplantation research. Traditionally, most risk models have been based on regression models.1 Although useful to help understand relationships between predictors and outcomes, these statistical methods can typically evaluate only a small number of predictors, which are assumed to affect everyone in the same way, and uniformly throughout the participants' lifespan. These methods have several limitations,2 including the inability to analyze nonlinear relationships, the requirement of setting a level of binary significance, impracticality for analyzing large datasets, and vulnerability to bias secondary to variable selection and/or omission of relevant confounders. With the emergence of P4 (Predictive, Preventive, Personalized, and Participatory) and Precision Medicine, artificial intelligence and machine learning methods have come to attention as methods aimed at solving the challenges in analysis not well addressed by regression approaches. Machine learning methods provide algorithms to understand patterns from large, complex, and heterogeneous data.3 Of the machine learning methods, recursive partitioning, and especially random forests, can deal with large numbers of predictor variables even in the presence of complex interactions.2,4 These methods have been applied successfully in genetics, clinical research, and bioinformatics. In this issue of Transplantation, Scheffner et al report on the development and internal validation of a random forest prediction model for patient survival.5 Random forest models are composed of a collection of decision trees. In the process of building each decision tree, different random subsets of the variables from the training dataset are selected to establish how best to partition the dataset at each node.6 Random forest models are considered less vulnerable to overfitting the training dataset given the large number of trees built, making each tree an independent model. The lower likelihood of bias is a result of bootstrapping several trees over randomly selected subsets of variables and subsamples of data.6 Random forest models require little preprocessing of data; the data need not be normalized; and the approach is resilient to outliers. While missing data will be a challenge when trying to draw clinical inferences from standard statistical models, machine learning methods tend to make fewer assumptions about the underlying data and, thus, are less vulnerable to the challenges associated with violation of those assumptions. Relying on fewer assumptions than regression analysis, machine learning methods have been shown to deliver more robust predictions. Scheffner and colleagues5 split a retrospective cohort of kidney transplant recipients with posttransplantation protocol biopsies into training and validation datasets (Figure 2A and B). Using all pretransplant and 3- and 12-months posttransplant variables, the obtained models showed good performance to predict death (concordance index: 0.77–0.78). Validation showed a concordance index of 0.76 and good discrimination of risks by the models, despite substantial differences in clinical variables and the derivation dataset representing an earlier era (2000–2007) than the validation dataset (2008–2013). To contrast with outputs of multivariable regression models using the same datasets, see Tables 2 and 3 and nomograms predicting mortality risk using estimators from multivariable Cox models (Figure 3) in Abeling et al.7 Random survival forests also inform on the importance of descriptive variables.6 Scheffner found the potentially modifiable (and highly correlated) graft rejection treatment and urinary tract infection to be important predictors of patient survival in addition to established factors like age, cardiovascular disease, diabetes, and graft function (Figure 3A and B).5 Many of the predictors retained in multivariable regression models7 were also deemed important in random forest survival analyses.5 To validate selected predictors and model construction, it is important to pursue external validation with independent datasets. Random survival forests may complement regression analyses when handling highly correlated complex survival data. Opportunities for application (and limitations) of each of the regression and random survival forests for prediction are summarized in Table 1.TABLE 1.: Regression and random survival forests for survival analysisPredictive models in transplantation and donation help risk stratify patients and could improve quality of healthcare delivery as well as patient outcomes. The increasing interest in these tools warrants a better understanding of their challenges and limitations.8 First, highly predictive variables may not necessarily be causally related to the outcomes of interest. Second, the success of machine learning models depends on the relationship between predictors and outcome being represented in training/validation datasets, the number of observations and features, selection and parameterization of features, and the algorithm chosen for the model. Careful variable definition (eg, urinary tract infection) is necessary. Presence of highly correlated linear and nonlinear relationships between independent variables may warrant mechanisms for removal of the correlated variables. Model performance may also be compromised when studying rare outcomes.4 Inevitably, generalizability of machine learning models may be limited when the clinical context, local factors (including patient/physician preferences, health systems, and care standards), and therapeutic strategies vary. To enable assessment of model validity, correct interpretation of model outputs, replication, and future knowledge synthesis, it is vital that the transplantation and donation community promote adherence to guidelines on the dissemination and reporting of machine learning models.8,9 Authors should be encouraged to report all model parameters, transformations applied to raw data, sampling methods, and random number generator seeds. Whenever possible, algorithms and associated code should be released in public software archive domains. There is a need for new models of health data ownership with rights to the individual, highly secure data repositories, government legislation for data sharing, and usage policies to ensure privacy and data security. Moreover, with wide uptake of machine learning and artificial intelligence tools, the scale of iatrogenic risks and liabilities related to their application, in contrast to the implications of a single doctor's mistake for a given patient, also warrant assessment.10 Most practice guidelines are geared toward the "average patient." Machine learning tools can capture the complexity of individual patients' characteristics and aid transplant clinicians with patient-specific care decisions. As these tools become more prevalent, it is important to develop best practice guidelines and ensure there is regulatory oversight on their development and application.

  • Research Article
  • Cite Count Icon 2
  • 10.1080/00015385.2025.2481662
Application of an interpretable machine learning method to predict the risk of death during hospitalization in patients with acute myocardial infarction combined with diabetes mellitus
  • Apr 7, 2025
  • Acta Cardiologica
  • Zhijun Bu + 12 more

Background Predicting the prognosis of patients with acute myocardial infarction (AMI) combined with diabetes mellitus (DM) is crucial due to high in-hospital mortality rates. This study aims to develop and validate a mortality risk prediction model for these patients by interpretable machine learning (ML) methods. Methods Data were sourced from the Medical Information Mart for Intensive Care IV (MIMIC-IV, version 2.2). Predictors were selected by Least absolute shrinkage and selection operator (LASSO) regression and checked for multicollinearity with Spearman’s correlation. Patients were randomly assigned to training and validation sets in an 8:2 ratio. Seven ML algorithms were used to construct models in the training set. Model performance was evaluated in the validation set using metrics such as area under the curve (AUC) with 95% confidence interval (CI), calibration curves, precision, recall, F1 score, accuracy, negative predictive value (NPV), and positive predictive value (PPV). The significance of differences in predictive performance among models was assessed utilising the permutation test, and 10-fold cross-validation further validated the model’s performance. SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME) were applied to interpret the models. Results The study included 2,828 patients with AMI combined with DM. Nineteen predictors were identified through LASSO regression and Spearman’s correlation. The Random Forest (RF) model was demonstrated the best performance, with an AUC of 0.823 (95% CI: 0.774–0.872), high precision (0.867), accuracy (0.873), and PPV (0.867). The RF model showed significant differences (p < 0.05) compared to the K-Nearest Neighbours and Decision Tree models. Calibration curves indicated that the RF model’s predicted risk aligned well with actual outcomes. 10-fold cross-validation confirmed the superior performance of RF model, with an average AUC of 0.828 (95% CI: 0.800–0.842). Significant Variables in RF model indicated that the top eight significant predictors were urine output, maximum anion gap, maximum urea nitrogen, age, minimum pH, maximum international normalised ratio (INR), mean respiratory rate, and mean systolic blood pressure. Conclusion This study demonstrates the potential of ML methods, particularly the RF model, in predicting in-hospital mortality risk for AMI patients with DM. The SHAP and LIME methods enhance the interpretability of ML models.

  • Research Article
  • Cite Count Icon 4
  • 10.1186/s41043-024-00647-8
Prediction and feature selection of low birth weight using machine learning algorithms
  • Oct 12, 2024
  • Journal of Health, Population and Nutrition
  • Tasneem Binte Reza + 1 more

Background and aimsThe birth weight of a newborn is a crucial factor that affects their overall health and future well-being. Low birth weight (LBW) is a widespread global issue, which the World Health Organization defines as weighing less than 2,500 g. LBW can have severe negative consequences on an individual’s health, including neonatal mortality and various health concerns throughout their life. To address this problem, this study has been conducted using BDHS 2017–2018 data to uncover important aspects of LBW using a variety of machine learning (ML) approaches and to determine the best feature selection technique and best predictive ML model.MethodsTo pick out the key features, the Boruta algorithm and wrapper method were used. Logistic Regression (LR) used as traditional method and several machine learning classifiers were then used, including, DT (Decision Tree), SVM (Support Vector Machine), NB (Naïve Bayes), RF (Random Forest), XGBoost (eXtreme Gradient Boosting), and AdaBoost (Adaptive Boosting), to determine the best model for predicting LBW. The model’s performance was evaluated based on the specificity, sensitivity, accuracy, F1 score and AUC value.ResultsResult shows, Boruta algorithm identifies eleven significant features including respondent’s age, highest education level, educational attainment, wealth index, age at first birth, weight, height, BMI, age at first sexual intercourse, birth order number, and whether the child is a twin. Incorporating Boruta algorithm’s significant features, the performance of traditional LR and ML methods including DT, SVM, NB, RF, XGBoost, and AB were evaluated where LR, had a specificity, sensitivity, accuracy and F1 score of 0.85, 0.5, 85.15% and 0.915. While the ML methods DT, SVM, NB, RF, XGBoost, and AB model’s respective accuracy values were 85.35%, 85.15%, 84.54%, 81.18%, and 84.41%. Based on the specificity, sensitivity, accuracy, F1 score and AUC, RF (specificity = 0.99, sensitivity = 0.58, accuracy = 85.86%, F1 score = 0.9243, AUC = 0.549) outperformed the other methods. Both the classical (LR) and machine learning (ML) models’ performance has improved dramatically when important characteristics are extracted using the wrapper method. The LR method identified five significant features with a specificity, sensitivity, accuracy and F1 score of 0.87, 0.33, 87.12% and 0.9309. The region, whether the infant is a twin, and cesarean delivery were the three key features discovered by the DT and RF models, which were implemented using the wrapper technique. All three models had the identical F1 score of 0.9318. However, “child is twin” was recognized as a significant feature by the SVM, NB, and AB models, with an F1 score of 0.9315. Ultimately, with an F1 score of 0.9315, the XGBoost model recognized “child is twin” and “age at first sex” as relevant features. Random Forest again beat the other approaches in this instance.ConclusionsThe study reveals Wrapper method as the optimal feature selection technique. The ML method outperforms traditional methods, with Random Forest (RF) being the most effective predictive model for Low-Birth-Weight prediction. The study suggests that policymakers in Bangladesh can mitigate low birth weight newborns by considering identified risk factors.

  • Research Article
  • Cite Count Icon 4
  • 10.1021/acs.jctc.9b01246
Pair Potentials as Machine Learning Features.
  • Jun 19, 2020
  • Journal of chemical theory and computation
  • Jun Pei + 2 more

Atom pairwise potential functions make up an essential part of many scoring functions for protein decoy detection. With the development of machine learning (ML) tools, there are multiple ways to combine potential functions to create novel ML models and methods. Potential function parameters can be easily extracted; however, it is usually hard to directly obtain the calculated atom pairwise energies from scoring functions. Amber, as one of the most popular suites of modeling programs, has an extensive history and library of force field potential functions. In this work, we directly used the force field parameters in ff94 and ff14SB from Amber and encoded them to calculate atom pairwise energies for different interactions. Two sets of structures (single amino acid set and a dipeptide set) were used to evaluate the performance of our encoded Amber potentials. From the comparison results between energy terms obtained from our encoding and Amber, we find energy difference within ±0.06 kcal/mol for all tested structures. Previously we have shown that the Random Forest (RF) model can help to emphasize more important atom pairwise interactions and ignore insignificant ones [Pei, J.; Zheng, Z.; Merz, K. M. J. Chem. Inf. Model. 2019, 59, 1919-1929]. Here, as an example of combining ML methods with traditional potential functions, we followed the same work flow to combine the RF models with force field potential functions from Amber. To determine the performance of our RF models with force field potential functions, 224 different protein native-decoy systems were used as our training and testing sets We find that the RF models with ff94 and ff14SB force field parameters outperformed all other scoring functions (RF models with KECSA2, RWplus, DFIRE, dDFIRE, and GOAP) considered in this work for native structure detection, and they performed similarly in detecting the best decoy. Through inclusion of best decoy to decoy comparisons in building our RF models, we were able to generate models that outperformed the score functions tested herein both on accuracy and best decoy detection, again showing the performance and flexibility of our RF models to tackle this problem. Finally, the importance of the RF algorithm and force field parameters were also tested and the comparison results suggest that both the RF algorithm and force field potentials are important with the ML scoring function achieving its best performance only by combining them together. All code and data used in this work are available at https://github.com/JunPei000/FFENCODER_for_Protein_Folding_Pose_Selection.

  • Research Article
  • 10.3390/agriculture15010036
UAV-Multispectral Based Maize Lodging Stress Assessment with Machine and Deep Learning Methods
  • Dec 26, 2024
  • Agriculture
  • Minghu Zhao + 4 more

Maize lodging is a prevalent stress that can significantly diminish corn yield and quality. Unmanned aerial vehicles (UAVs) remote sensing is a practical means to quickly obtain lodging information at field scale, such as area, severity, and distribution. However, existing studies primarily use machine learning (ML) methods to qualitatively analyze maize lodging (lodging and non-lodging) or estimate the maize lodging percentage, while there is less research using deep learning (DL) to quantitatively estimate maize lodging parameters (type, severity, and direction). This study aims to introduce advanced DL algorithms into the maize lodging classification task using UAV-multispectral images and investigate the advantages of DL compared with traditional ML methods. This study collected a UAV-multispectral dataset containing non-lodging maize and lodging maize with different lodging types, severities, and directions. Additionally, 22 vegetation indices (VIs) were extracted from multispectral data, followed by spatial aggregation and image cropping. Five ML classifiers and three DL models were trained to classify the maize lodging parameters. Finally, we compared the performance of ML and DL models in evaluating maize lodging parameters. The results indicate that the Random Forest (RF) model outperforms the other four ML algorithms, achieving an overall accuracy (OA) of 89.29% and a Kappa coefficient of 0.8852. However, the maize lodging classification performance of DL models is significantly better than that of ML methods. Specifically, Swin-T performs better than ResNet-50 and ConvNeXt-T, with an OA reaching 96.02% and a Kappa coefficient of 0.9574. This can be attributed to the fact that Swin-T can more effectively extract detailed information that accurately characterizes maize lodging traits from UAV-multispectral data. This study demonstrates that combining DL with UAV-multispectral data enables a more comprehensive understanding of maize lodging type, severity, and direction, which is essential for post-disaster rescue operations and agricultural insurance claims.

  • Research Article
  • Cite Count Icon 5
  • 10.54097/hset.v49i.8513
Sales Prediction of Walmart Sales Based on OLS, Random Forest, and XGBoost Models
  • May 21, 2023
  • Highlights in Science, Engineering and Technology
  • Tian Yang

The technique of estimating future sales levels for a good or service is known as sales forecasting. The corresponding forecasting methods range from initially qualitative analysis to later time series methods, regression analysis and econometric models, as well as machine learning methods that have emerged in recent decades. This paper compares the different performances of OLS, Random Forest and XGBoost machine learning models in predicting the sales of Walmart stores. According to the analysis, XGBoost model has the best sales forecasting ability. In the case of logarithmic sales, R2 of the XGBoost model is as high as 0.984, while MSE and MAE are only 0.065 and 0.124, respectively. The XGBoost model is therefore an option when making sales forecasts. These results compare different types of models, find out the best prediction model, and provide suggestions for future prediction model selection.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1007/s10489-022-04327-0
Rapid extraction of skin physiological parameters from hyperspectral images using machine learning
  • Dec 10, 2022
  • Applied Intelligence
  • Teo Manojlović + 3 more

Noninvasive assessment of skin structure using hyperspectral images has been intensively studied in recent years. Due to the high computational cost of the classical methods, such as the inverse Monte Carlo (IMC), much research has been done with the aim of using machine learning (ML) methods to reduce the time required for estimating parameters. This study aims to evaluate the accuracy and the estimation speed of the ML methods for this purpose and compare them to the traditionally used inverse adding-doubling (IAD) algorithm. We trained three models – an artificial neural network (ANN), a 1D convolutional neural network (CNN), and a random forests (RF) model – to predict seven skin parameters. The models were trained on simulated data computed using the adding-doubling algorithm. To improve predictive performance, we introduced a stacked dynamic weighting (SDW) model combining the predictions of all three individually trained models. SDW model was trained by using only a handful of real-world spectra on top of the ANN, CNN and RF models that were trained using simulated data. Models were evaluated based on the estimated parameters’ mean absolute error (MAE), considering the surface inclination angle and comparing skin spectra with spectra fitted by the IAD algorithm. On simulated data, the lowest MAE was achieved by the RF model (0.0030), while the SDW model achieved the lowest MAE on in vivo measured spectra (0.0113). The shortest time to estimate parameters for a single spectrum was 93.05 μs. Results suggest that ML algorithms can produce accurate estimates of human skin optical parameters in near real-time.

  • Research Article
  • 10.1080/22797254.2025.2455940
Modeling of winter wheat yield prediction based on solar-induced chlorophyll fluorescence by machine learning methods
  • Jan 24, 2025
  • European Journal of Remote Sensing
  • Minxue Zheng + 5 more

Timely and accurate prediction of large-scale crop yields is critical for national food security. Solar-induced chlorophyll fluorescence (SIF), an indicator of photosynthesis, has emerged as a promising predictor of crop yields. However, it remains unclear to what extent satellite-based SIF data can predict crop yields at the regional scale compared to the newly proposed Near-Infrared Reflectance of Vegetation (NIRv). Using multiple statistical machine learning (ML) methods, this study investigated the predictive abilities of SIF and NIRv by combining climate data to predict winter wheat yields in five provinces in the North China Plain (NCP). Results showed that: (a) SIF outperformed NIRv in predicting winter wheat yields. However, in the Extreme Gradient Boosting (XGB) model, SIF’s predictive performance was better than that of the combination of SIF and NIRv, indicating that combining SIF and NIRv could not completely enhance SIF’s predictive performance. (b) Random Forest (RF) and XGB models were significantly better than the other models in yield prediction; specifically, the RF model had high stability. The results highlighted the benefits of combining multiple sources of data and revealed the advantages of RF and XGB models in crop yield prediction in the major grain production region.

  • Research Article
  • Cite Count Icon 8
  • 10.1029/2022jc018980
Oceanic Primary Production Estimation Based On Machine Learning
  • May 1, 2023
  • Journal of Geophysical Research: Oceans
  • Bo Ping + 3 more

Oceanic primary production (OPP) is crucial for ecosystem services and global carbon cycle. However, sensitivity to geographic and environmental characteristics limits the application of semi‐empirical OPP estimate models, such as the vertically generalized productivity model (VGPM) and its modified version, particularly in coastal regions. In addition, the difficulty in collecting necessary parameters also hampers long‐term OPP estimates. Data‐driven machine learning (ML) methods can automatically capture the relationships between the input parameters and the objective; hence, they may become new methods for global OPP estimates. In this study, the effectiveness of ML methods to estimate OPP and the key attributes influencing ML performance in different regions and seasons are discussed. First, the ML models obtain a lower root mean square error than the VGPM and Eppley‐VGPM. In addition, the random forest (RF) model achieves the best performance among the four selected ML models. The enhancement in the accuracy of OPP estimates based on the RF model is more obvious in coastal regions than in the open ocean. In the four seasons, the RF model obtains better estimates of OPP than the Eppley‐VGPM, especially for summer. Moreover, input attributes, including sea surface temperature (SST), photosynthetic active radiation (PAR), and chlorophyll‐a concentration (Chlor‐a), achieve the best performance. The suitable alternative input attributes are SST/Chlor‐a in the coastal regions, and single Chlor‐a, SST/Chlor‐a, and PAR/Chlor‐a in the open ocean. Except for the SST/PAR/Chlor‐a combination, input with Chlor‐a, that is, SST/Chlor‐a and PAR/Chlor‐a, result a relatively acceptable performance in the four seasons.

  • Research Article
  • Cite Count Icon 2
  • 10.31035/cg2023056
Comparative study of different machine learning models in landslide susceptibility assessment: A case study of Conghua District, Guangzhou, China
  • Feb 6, 2024
  • China Geology
  • Ao Zhang + 10 more

Comparative study of different machine learning models in landslide susceptibility assessment: A case study of Conghua District, Guangzhou, China

  • Research Article
  • 10.56038/oprd.v1i1.136
Sales Forecasting System for Van-Sales Channel for FMCG Industry
  • Dec 31, 2022
  • Orclever Proceedings of Research and Development
  • Seza Dursun + 3 more

In the Fast Moving Consumer Goods (FMCG) sector, the availability of sufficient product inventory on the delivery vehicle is directly related to the accuracy of the sales forecasts. Insufficient accuracy of the estimations leads to loss of income and increases secondary costs such as transportation and labor costs. In the current situation, sales forecasts are based on the sales personnel's delivery route, knowledge, experience, and relationships. Since the knowledge and experience of the personnel are not brought into the institutional memory, this information is lost with the personnel change, and the new person needs to develop their own experiences about the route. Currently, the sales forecasting accuracy rate is calculated as 70%. It has been determined that a daily loss of 15% on a product basis and a total of 5% daily occurs. In the study carried out within the scope of this research, advanced analytical and machine learning methods that can capture the dynamics of the FMCG industry and analyze the extensive data formed effectively are studied to increase the accuracy and consistency of sales forecasts. Within the scope of the research, machine learning models to be used for sales forecasts were developed using artificial neural networks methods. We evaluated the models' performance according to the recall, precision, and accuracy metrics based on the route, point of sale, and product. It was determined that the artificial neural networks performs well for sales forecasting. Using the artificial neural networks in the experimental study, we achieved an average of 5% revenue increase for the three route groups selected as pilots. The sales forecast accuracy rate increased from 78% to 82%.

  • Abstract
  • 10.1016/j.jval.2020.04.1006
PND117 IDENTIFYING PREDICTORS OF HIGH-COST MULTIPLE SCLEROSIS PATIENTS: A MACHINE LEARNING APPROACH
  • May 1, 2020
  • Value in Health
  • S.M Burns + 2 more

PND117 IDENTIFYING PREDICTORS OF HIGH-COST MULTIPLE SCLEROSIS PATIENTS: A MACHINE LEARNING APPROACH

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.psj.2024.104489
An investigation of machine learning methods applied to genomic prediction in yellow-feathered broilers
  • Nov 1, 2024
  • Poultry Science
  • Bogong Liu + 6 more

An investigation of machine learning methods applied to genomic prediction in yellow-feathered broilers

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 20
  • 10.1038/s41598-024-56466-8
Machine learning study using 2020 SDHS data to determine poverty determinants in Somalia
  • Mar 12, 2024
  • Scientific Reports
  • Abdirizak A Hassan + 2 more

Extensive research has been conducted on poverty in developing countries using conventional regression analysis, which has limited prediction capability. This study aims to address this gap by applying advanced machine learning (ML) methods to predict poverty in Somalia. Utilizing data from the first-ever 2020 Somalia Demographic and Health Survey (SDHS), a cross-sectional study design is considered. ML methods, including random forest (RF), decision tree (DT), support vector machine (SVM), and logistic regression, are tested and applied using R software version 4.1.2, while conventional methods are analyzed using STATA version 17. Evaluation metrics, such as confusion matrix, accuracy, precision, sensitivity, specificity, recall, F1 score, and area under the receiver operating characteristic (AUROC), are employed to assess the performance of predictive models. The prevalence of poverty in Somalia is notable, with approximately seven out of ten Somalis living in poverty, making it one of the highest rates in the region. Among nomadic pastoralists, agro-pastoralists, and internally displaced persons (IDPs), the poverty average stands at 69%, while urban areas have a lower poverty rate of 60%. The accuracy of prediction ranged between 67.21% and 98.36% for the advanced ML methods, with the RF model demonstrating the best performance. The results reveal geographical region, household size, respondent age group, husband employment status, age of household head, and place of residence as the top six predictors of poverty in Somalia. The findings highlight the potential of ML methods to predict poverty and uncover hidden information that traditional statistical methods cannot detect, with the RF model identified as the best classifier for predicting poverty in Somalia.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.fbio.2022.102216
Application of multivariate machine learning methods to investigate organic compound content of different pepper spices
  • Nov 22, 2022
  • Food Bioscience
  • Yusuf Durmuş + 1 more

Application of multivariate machine learning methods to investigate organic compound content of different pepper spices

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.