Cost‐sensitive tree SHAP for explaining cost‐sensitive tree‐based models
Abstract Cost‐sensitive ensemble learning as a combination of two approaches, ensemble learning and cost‐sensitive learning, enables generation of cost‐sensitive tree‐based ensemble models using the cost‐sensitive decision tree (CSDT) learning algorithm. In general, tree‐based models characterize nice graphical representation that can explain a model's decision‐making process. However, the depth of the tree and the number of base models in the ensemble can be a limiting factor in comprehending the model's decision for each sample. The CSDT models are widely used in finance (e.g., credit scoring and fraud detection) but lack effective explanation methods. We previously addressed this gap with cost‐sensitive tree Shapley Additive Explanation Method (CSTreeSHAP), a cost‐sensitive tree explanation method for the single‐tree CSDT model. Here, we extend the introduced methodology to cost‐sensitive ensemble models, particularly cost‐sensitive random forest models. The paper details the theoretical foundation and implementation details of CSTreeSHAP for both single CSDT and ensemble models. The usefulness of the proposed method is demonstrated by providing explanations for single and ensemble CSDT models trained on well‐known benchmark credit scoring datasets. Finally, we apply our methodology and analyze the stability of explanations for those models compared to the cost‐insensitive tree‐based models. Our analysis reveals statistically significant differences between SHAP values despite seemingly similar global feature importance plots of the models. This highlights the value of our methodology as a comprehensive tool for explaining CSDT models.
- Research Article
7
- 10.3390/e21020198
- Feb 19, 2019
- Entropy
Uncertainty evaluation based on statistical probabilistic information entropy is a commonly used mechanism for a heuristic method construction of decision tree learning. The entropy kernel potentially links its deviation and decision tree classification performance. This paper presents a decision tree learning algorithm based on constrained gain and depth induction optimization. Firstly, the calculation and analysis of single- and multi-value event uncertainty distributions of information entropy is followed by an enhanced property of single-value event entropy kernel and multi-value event entropy peaks as well as a reciprocal relationship between peak location and the number of possible events. Secondly, this study proposed an estimated method for information entropy whose entropy kernel is replaced with a peak-shift sine function to establish a decision tree learning (CGDT) algorithm on the basis of constraint gain. Finally, by combining branch convergence and fan-out indices under an inductive depth of a decision tree, we built a constraint gained and depth inductive improved decision tree (CGDIDT) learning algorithm. Results show the benefits of the CGDT and CGDIDT algorithms.
- Research Article
2
- 10.1029/2023ea003311
- Feb 1, 2024
- Earth and Space Science
Satellite infrared detectors cannot penetrate clouds, especially precipitating clouds. Improving precipitation estimation accuracy based on infrared brightness temperature has always been important but challenging. In this paper, based on the infrared brightness temperature of the Advanced Geosynchronous Radiation Imager (AGRI) onboard China's Feng‐Yun 4A satellite, we develop and evaluate a new precipitation estimation method. First, using static data, physical characteristics of clouds, cloud image texture features, temporal motion features, and AGRI infrared channel brightness temperature, we construct features for a machine learning model. Then, we develop precipitation estimation methods. Precipitation is estimated in two steps: classification and regression. We employ a random forest classification model to identify whether there is precipitation in a given field of view. If there is precipitation, a multi‐model ensemble regression learning method is used to estimate the areas with this precipitation. The ensemble learning method uses convex optimization to integrate prediction results based on the optimization of hyperparameters of five basic models (i.e., those of random forest, XGBoost, LightGBM, decision tree, and extra tree models). Furthermore, two regression stacking ensemble models—the Least Absolute Shrinkage and Selection Operator (herein referred to as Stacking1‐LASSO) and K‐nearest neighbor (herein referred to as Stacking2‐KNN)—are used to predict the results of the aforementioned basic models. The results of basic models are used as inputs of these two stacking models. Finally, based on the Integrated Multi‐satellitE Retrievals for GPM (IMERG) precipitation product and rain gauge precipitation data, we conduct precipitation estimation experiments and evaluate our methods. The results show that ensemble learning models have greater accuracy in estimating precipitation than the basic models. When using IMERG precipitation as the target precipitation, ensemble learning models can estimate the central area of heavy precipitation during typhoons Ampil and Maria. The ensemble learning estimation effect is better than that of Stacking2‐KNN. Moreover, when rain gauge data is used as the target precipitation, ensemble learning can also estimate the center of heavy precipitation and with good consistency with recorded satellite brightness temperature data.
- Research Article
1
- 10.16250/j.32.1915.2024136
- Dec 12, 2024
- Zhongguo xue xi chong bing fang zhi za zhi = Chinese journal of schistosomiasis control
To predict the potential geographic distribution of Oncomelania hupensis in Yunnan Province using random forest (RF) and maximum entropy (MaxEnt) models, so as to provide insights into O. hupensis surveillance and control in Yunnan Province. The O. hupensis snail survey data in Yunnan Province from 2015 to 2016 were collected and converted into O. hupensis snail distribution site data. Data of 22 environmental variables in Yunnan Province were collected, including twelve climate variables (annual potential evapotranspiration, annual mean ground surface temperature, annual precipitation, annual mean air pressure, annual mean relative humidity, annual sunshine duration, annual mean air temperature, annual mean wind speed, ≥ 0 ℃ annual accumulated temperature, ≥ 10 ℃ annual accumulated temperature, aridity and index of moisture), eight geographical variables (normalized difference vegetation index, landform type, land use type, altitude, soil type, soil textureclay content, soil texture-sand content and soil texture-silt content) and two population and economic variables (gross domestic product and population). Variables were screened with Pearson correlation test and variance inflation factor (VIF) test. The RF and MaxEnt models and the ensemble model were created using the biomod2 package of the software R 4.2.1, and the potential distribution of O. hupensis snails after 2016 was predicted in Yunnan Province. The predictive effects of models were evaluated through cross-validation and independent tests, and the area under the receiver operating characteristic curve (AUC), true skill statistics (TSS) and Kappa statistics were used for model evaluation. In addition, the importance of environmental variables was analyzed, the contribution of environmental variables output by the models with AUC values of > 0.950 and TSS values of > 0.850 were selected for normalization processing, and the importance percentage of environmental variables was obtained to analyze the importance of environmental variables. Data of 148 O. hupensis snail distribution sites and 15 environmental variables were included in training sets of RF and MaxEnt models, and both RF and MaxEnt models had high predictive performance, with both mean AUC values of > 0.900 and all mean TSS values and Kappa values of > 0.800, and significant differences in the AUC (t = 19.862, P < 0.05), TSS (t = 10.140, P < 0.05) and Kappa values (t = 10.237, P < 0.05) between two models. The AUC, TSS and Kappa values of the ensemble model were 0.996, 0.954 and 0.920, respectively. Independent data verification showed that the AUC, TSS and Kappa values of the RF model and the ensemble model were all 1, which still showed high performance in unknown data modeling, and the MaxEnt model showed poor performance, with TSS and Kappa values of 0 for 24%(24/100) of the modeling results. The modeling results of 79 RF models, 38 MaxEnt models and their ensemble models with AUC values of > 0.950 and TSS values of > 0.850 were included in the evaluation of importance of environmental variables. The importance of annual sunshine duration (SSD) was 32.989%, 37.847% and 46.315% in the RF model, the MaxEnt model and their ensemble model, while the importance of annual mean relative humidity (RHU) was 30.947%, 15.921% and 28.121%, respectively. Important environment variables were concentrated in modeling results of the RF model, dispersed in modeling results of the MaxEnt model, and most concentrated in modeling results of the ensemble model. The potential distribution of O. hupensis snails after 2016 was predicted to be relatively concentrated in Yunnan Province by the RF model and relatively large by the MaxEnt model, and the distribution of O. hupensis snails predicted by the ensemble model was mostly the joint distribution of O. hupensis snails predicted by RF and MaxEnt models. Both RF and MaxEnt models are effective to predict the potential distribution of O. hupensis snails in Yunnan Province, which facilitates targeted O. hupensis snail control.
- Research Article
18
- 10.1007/s11063-016-9528-8
- Jun 8, 2016
- Neural Processing Letters
Learning from data streams is a challenging task which demands a learning algorithm with several high quality features. In addition to space complexity and speed requirements needed for processing the huge volume of data which arrives at high speed, the learning algorithm must have a good balance between stability and plasticity. This paper presents a new approach to induce incremental decision trees on streaming data. In this approach, the internal nodes contain trainable split tests. In contrast with traditional decision trees in which a single attribute is selected as the split test, each internal node of the proposed approach contains a trainable function based on multiple attributes, which not only provides the flexibility needed in the stream context, but also improves stability. Based on this approach, we propose evolving fuzzy min–max decision tree (EFMMDT) learning algorithm in which each internal node of the decision tree contains an evolving fuzzy min–max neural network. EFMMDT splits the instance space non-linearly based on multiple attributes which results in much smaller and shallower decision trees. The extensive experiments reveal that the proposed algorithm achieves much better precision in comparison with the state-of-the-art decision tree learning algorithms on the benchmark data streams, especially in the presence of concept drift.
- Research Article
- 10.1155/acis/5211419
- Jan 1, 2025
- Applied Computational Intelligence and Soft Computing
This study presents ensemble machine learning (ML) models for predicting residential energy consumption in South Africa. By combining the best features of individual ML models, ensemble models reduce the drawbacks of each model and improve prediction accuracy. We present four ensemble models: ensemble by averaging (EA), ensemble by stacking each estimator (ESE), ensemble by boosting (EB), and ensemble by voting estimator (EVE). These models are built on top of Random Forest (RF) and Decision Tree (DT). These base predictor models leverage historical energy consumption patterns to capture temporal intricacies, including seasonal variations and rolling averages. In addition, we employed feature engineering methodologies to further enhance their predictive abilities. The accuracy of each ensemble model was evaluated by assessing various performance indicators, including the mean squared error (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination R2. Overall, the findings illustrate the efficiency of ensemble learning models in providing accurate predictions for residential energy consumption. This study provides valuable insights for researchers and practitioners in predicting energy consumption in residential buildings and the benefits of using ensemble learning models in the building and energy research domains.
- Research Article
20
- 10.1080/10106049.2022.2152493
- Nov 24, 2022
- Geocarto International
Machine learning models are gradually replacing traditional techniques used for landslide susceptibility assessment. This study aims to comprehensively compare multiple models, including linear, nonlinear, and ensemble models, based on 5281 historical landslides in southwest China, the area most severely affected by the landslide disaster. Linear models represented by logistic regression (LR), nonlinear models represented by support vector machine (SVM), artificial neural network (ANN) and classification 5.0 decision tree (C5.0 DT), and ensemble models represented by random forest (RF) and categorical boosting (Catboost) were selected. The correlation coefficient, variance inflation factor (VIF), and relative important analysis were used to select the dominate landslide conditioning factors. Using multiple statistical indicators (e.g. Area Under the Receiver Operating Characteristic curve (AUC) and Kappa), cross-validation and qualitative methods to evaluate the models’ performance. The findings are: (1) Regarding the model predictive performance, the best predictive performance was demonstrated by the ensemble models Catboost (AUC = 0.823 and Kappa = 0.593) and RF (AUC = 0.821 and Kappa = 0.582), followed by the nonlinear models SVM (AUC = 0.775 and Kappa = 0.520), ANN (AUC = 0.770 and Kappa = 0.486) and C5.0 DT (AUC = 0.751 and Kappa = 0.497), while the linear model LR (AUC = 0.756 and Kappa = 0.456) had a more limited performance. The ensemble model, which uses a tree as its baseline classifier, has a lot of potential for studies into the landslide susceptibility. (2) Regarding the model robustness, the three types of models in nonspatial cross-validation (CV) performed relatively similarly in terms of predictive power, while in spatial cross-validation (SPCV), the linear model LR (median AUC = 0.714) achieved better results than the ensemble and nonlinear models. It implies that when the distribution of landslides is not homogeneous, linear models may be the most robust. It is advisable to consider various evaluation metrics from different perspectives and integrate them with specialist qualitative geomorphological empirical knowledge to determine the best model. (3) The Gini index-based RF model suggests that road density was the dominant factor in the frequency of landslides in the study area.
- Research Article
145
- 10.1016/j.gsf.2023.101645
- Jun 7, 2023
- Geoscience Frontiers
Ensemble learning framework for landslide susceptibility mapping: Different basic classifier and ensemble strategy
- Abstract
3
- 10.1093/ehjdh/ztac076.2784
- Dec 22, 2022
- European Heart Journal. Digital Health
BackgroundRobust and accurate risk prediction models are much needed in cardiovascular disease. It is well-known that mental health is associated with the risk of developing cardiovascular disease. It is unknown whether mental health markers can enhance existing risk prediction models for cardiovascular disease.PurposeThe main purpose of this study was to assess capability of mental health factors along with traditional risk factors to be used in cardiovascular predictive machine learning models, and to develop a combined machine learning approach using both traditional risk and psychological factors in 375,145 participants of the UK Biobank.MethodsA comprehensive Pearson correlation analysis is carried out on UK Biobank data. Subsequently, an ensemble model containing decision tree, random forest, XGBoost, support vector machine (SVM), and deep neural network (DNN) classification approaches was built to predict cardiovascular diseases (CVD) in UK Biobank participants. The model was first trained using traditional cardiovascular risk factors, and subsequently trained using a combination of cardiovascular risk and psychological factors.ResultsThe correlation analysis revealed that there is a correlation between CVD and mental health factors suggesting the potential of mental health application for machine learning models. Our ensemble machine learning model was able to predict CVD with an accuracy of 73.49% using CVD risk factors alone. However, by combining psychological factors with CVD risk factors in the training data, an improved accuracy of 95.70% was achieved. The accuracy and robustness of ensemble machine learning model outperformed any of five constituent learning algorithms alone.ConclusionsOur results suggest that mental health assessment data along with traditional risk factors provides a powerful, safe and affordable machine learning model enrichment that can be used for state-of-the-art prediction of CVD.Funding AcknowledgementType of funding sources: None.Figure 1. Overview of CVD + Mental risk modelFigure 2. Results from the ensemble model
- Research Article
4
- 10.17485/ijst/v15i7.1715
- Feb 21, 2021
- Indian Journal of Science and Technology
Background/Objectives: Recent studies emphasized on using ensemble models over single ones to solve credit scoring problems. The objective of this study is to build a heterogeneous ensemble classifier model with an improved classification accuracy. Methods: This study focuses on developing a heterogeneous ensemble classifier using Logistic Regression, K-nearest neighbor, Decision tree, Random Forest, Naïve Base and Support vector machine as base classifiers and Random Forest, Logistic Regression and Support vector machine as meta-classifiers. The proposed model is built using these six base classifiers for ensemble aggregation. A feature selection algorithm based on the random forest technique is used for selecting the best features. A stacking and voting method are used for building ensemble model. Findings: The ensemble classifier gives superior predictive performance than single classifiers SVM, DT, RF, NB, KNN and LR with an accuracy of 91.56% for Australian dataset and 84.35% for German dataset. Novelty: The proposed model uses stacking and majority voting method for ensemble classification. Initially, stacking is applied to the base classifiers. This is done in two levels. First the training dataset is split into 10 folds for cross validation. The output of each classifier is taken, and the dataset is updated with the meta-features. In the second level, three meta-classifiers (MC), namely LR, SVM and RF are used. Majority voting is applied to the output of these meta-classifiers for the prediction. Keywords: Credit scoring; ensemble model; SVM; DT; RF; NB; KNN; LR
- Research Article
1
- 10.1038/s41598-026-37122-9
- Feb 5, 2026
- Scientific reports
Alzheimer's disease is a progressive neurodegenerative disorder characterized by memory loss and cognitive decline, with no known cure. Early detection of dementia, a primary manifestation of Alzheimer's disease, is critical to enable timely intervention and treatment planning. This study introduces ensemble learning models for predicting Alzheimer's disease and presents a comparative analysis between traditional machine learning and advanced ensemble models. The evaluation is conducted using the "Open Access Series of Imaging Studies" 2 (OASIS-2) dataset. Traditional models, including logistic regression, decision tree, support vector machine, and random forest, are benchmarked against ensemble models such as adaptive boosting, extreme gradient boosting, and a hyperparameter-tuned majority voting ensemble models. Performance is assessed using accuracy, precision, and the area under the receiver operating characteristic curve. Results show that ensemble models, particularly the optimized majority voting classifier, consistently outperform traditional methods. To complement the supervised comparison, exploratory unsupervised methods were applied using multiple correspondence analysis and k-means clustering to uncover latent structures in the dataset. By categorizing all variables, these unsupervised methods highlight patterns of clinical and demographic similarity. Unlike prior studies that focus solely on predictive accuracy, this work integrates supervised classification, ensemble learning, and unsupervised exploratory analysis within a unified framework. This combined approach enables both robust performance comparison and deeper insights into latent data structures relevant to Alzheimer's disease. All computational experiments were conducted using the Python programming language.
- Research Article
76
- 10.1007/s13369-019-03841-7
- Apr 10, 2019
- Arabian Journal for Science and Engineering
This article investigates the competence of ensemble learning techniques in solar irradiance prediction. It was seen from the literature survey, an ensemble tree model, random forests is studied more frequently as ensemble models. However, ensemble of support vector regression (SVR) and artificial neural networks (ANN) is also possible. So, this study is the first detailed evaluation of ensemble models in solar irradiance estimation domain. Boosting and bagging ensembles of SVR, ANN and decision tree (DT), are developed to estimate solar irradiance in hourly basis in five cities in Turkey. First frequently used base models (SVR, ANN, and DT) are created and tested with the use of 5 years meteorological data. Then boosting and bagging ensembles of the base models are developed and tested with the same data. The base models are compared with their ensemble counterparts in terms of average coefficient of determination (R2) and root mean squared error (RMSE). The comparative results show that boosting and bagging ensemble models improve SVR, ANN, and DT in terms of RMSE between 4.6 and 14.6% in average. The results show empirically that ensemble models improve prediction accuracies of various base regression models and it can be applied to other machine learning models used in solar irradiance prediction.
- Research Article
1
- 10.1108/jfc-08-2024-0264
- Mar 25, 2025
- Journal of Financial Crime
Purpose This study aims to propose a new ensemble learning model and compare its performance with other ensemble models to obtain the best model for detecting financial statement fraud during the COVID-19 pandemic. Design/methodology/approach This study uses a quantitative approach, using secondary data from financial reports, annual reports, regulatory reports and other information on the internet. It focuses on all companies listed on the Indonesia Stock Exchange from 2020 to 2023. The independent variables in this study use financial and nonfinancial variables. In contrast, the target variable for fraudulent financial reports is based on sanctions from regulators and the company’s special supervisory status. Findings This study results show that the ensemble blending model performs best in detecting financial statement fraud compared to the ensemble model that construct it. Research limitations/implications This study sets ensemble learning to default settings. Setting certain conditions can further improve the performance of ensemble learning models. Practical implications This study can broaden the insights of practitioners, academics, investors, regulators, stakeholders and corporate finance experts into detecting financial report fraud. Originality/value This study proposes a new ensemble learning model that previous studies have not discussed. This ensemble learning model performs best compared to other ensemble learning models.
- Research Article
80
- 10.1007/s10661-019-7362-y
- Mar 27, 2019
- Environmental Monitoring and Assessment
Groundwater resources are facing a high pressure due to drought and overexploitation. The main aim of this research is to apply rotation forest (RTF) with decision trees as base classifiers and an improved ensemble methodology based on evidential belief function and tree-based models (EBFTM) for preparing groundwater potential maps (GPM). The performance of these new models is then compared with three previously implemented models, i.e., boosted regression tree (BRT), classification and regression tree (CART), and random forest (RF). For this purpose, spring locations in the Meshgin Shahr in Iran were detected. The spring locations were randomly categorized into training (70% of the locations) and validation (30% of the locations) datasets. Furthermore, several groundwater conditioning factors (GCFs) such as hydrogeological, topographical, and land use factors were mapped and regarded as input variables. The tree-based algorithms (i.e., BRT, CART, RF, and RTF) were applied by implementing the input variables and training dataset. The groundwater potential values (i.e., spring occurrence probability) obtained by the BRT, CART, RF, and RTF models for all the pixels of the study area were classified into four potential classes and then used as inputs of the EBF model to construct the new ensemble model (i.e., EBFTM). At last, this paper implemented a receiver operating characteristics (ROC) curve for determining the efficiency of the EBFTM, RTF, BRT, CART, and RF methods. The findings illustrated that the EBFTM had the highest efficacy with an area under the ROC curve (AUC) of 90.4%, followed by the RF, BRT, CART, and RTF models with AUC-ROC values of 90.1, 89.8, 86.9, and 86.2%, respectively. Thus, it could be inferred that the ensemble approach is capable of improving the efficacy of the single tree-based models in GPM production.
- Research Article
6
- 10.3389/fphys.2024.1357404
- Apr 11, 2024
- Frontiers in Physiology
Objectives: An accurate prediction model for hyperuricemia (HUA) in adults remain unavailable. This study aimed to develop a stacking ensemble prediction model for HUA to identify high-risk groups and explore risk factors. Methods: A prospective health checkup cohort of 40899 subjects was examined and randomly divided into the training and validation sets with the ratio of 7:3. LASSO regression was employed to screen out important features and then the ROSE sampling was used to handle the imbalanced classes. An ensemble model using stacking strategy was constructed based on three individual models, including support vector machine, decision tree C5.0, and eXtreme gradient boosting. Model validations were conducted using the area under the receiver operating characteristic curve (AUC) and the calibration curve, as well as metrics including accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score. A model agnostic instance level variable attributions technique (iBreakdown) was used to illustrate the black-box nature of our ensemble model, and to identify contributing risk factors. Results: Fifteen important features were screened out of 23 clinical variables. Our stacking ensemble model with an AUC of 0.854, outperformed the other three models, support vector machine, decision tree C5.0, and eXtreme gradient boosting with AUCs of 0.848, 0.851 and 0.849 respectively. Calibration accuracy as well as other metrics including accuracy, specificity, negative predictive value, and F1 score were also proved our ensemble model's superiority. The contributing risk factors were estimated using six randomly selected subjects, which showed that being female and relatively younger, together with having higher baseline uric acid, body mass index, γ-glutamyl transpeptidase, total protein, triglycerides, creatinine, and fasting blood glucose can increase the risk of HUA. To further validate our model's applicability in the health checkup population, we used another cohort of 8559 subjects that also showed our ensemble prediction model had favorable performances with an AUC of 0.846. Conclusion: In this study, the stacking ensemble prediction model for HUA was developed, and it outperformed three individual models that compose it (support vector machine, decision tree C5.0, and eXtreme gradient boosting). The contributing risk factors were identified with insightful ideas.
- Research Article
10
- 10.1108/ejmbe-08-2023-0244
- Jan 1, 2025
- European Journal of Management and Business Economics
PurposeCryptocurrency markets are gaining popularity, with over 23,000 cryptocurrencies in 2023 and a total market valuation of 870.81 billion USD in 2023. With its increasing popularity, cryptocurrencies are also susceptible to volatility. Predicting the price with the least fallacy or more accuracy has become the need of the hour as it significantly influences investment decisions.Design/methodology/approachThis study aims to create a dynamic forecasting model using the ensemble method and test the forecasting accuracy of top 15 cryptocurrencies’ prices. Statistical and econometric model prediction accuracy is examined after hyper tuning the parameters. Drawing inferences from the statistical model, an ensemble model using machine learning (ML) algorithms is developed using gradient-boosted regressor (GBR), random forest regressor (RFR), support vector regression (SVR) and multi-layer perceptron (MLP). Validation curves are utilized to optimize model parameters and boost prediction accuracy.FindingsIt is found that when the price movement exhibits autocorrelation, the autoregressive integrated moving average (ARIMA) model and the ensemble model performed better. ARIMA, simple linear regression (SLR), random forest (RF), decision tree (DT), gradient boosting (GB) and multi-model regression (MLR) ensemble models performed well with coins, showing that trends, seasonality and historical price patterns are prominent. Furthermore, the MLR approach produces more accurate predictions for coins with higher volatility and irregular price patterns.Research limitations/implicationsAlthough the dataset includes crisis period data, anomalies or outliers are yet to be explicitly excluded from the analysis. The models employed in this study still demonstrate high accuracy in predicting cryptocurrency prices despite these outliers, suggesting that the models are robust enough to handle unexpected fluctuations or extreme events in the market. However, the lack of specific analysis on the impact of outliers on model performance is a limitation of the study, as it needs to fully explore the resilience of the forecasting models under adverse market conditions.Practical implicationsThe present study contributes to the body of literature on ensemble methods in forecasting crypto price in general, potentially influencing future studies on price forecasting. The study motivates the researchers on empirical testing of our framework on various asset classes. As a result, on the prediction ability of ensemble model, the study will significantly influence the decision-making process of traders and investors. The research benefits the traders and investors to effectively develop a model to forecast cryptocurrency price. The findings highlight the potential of ensemble model in predicting high volatile cryptocurrencies and other financial assets. Investors can design the investment strategies and asset allocation decisions by understanding the relationship between market trends and consumer behavior. Investors can enhance portfolio performance and mitigate risk by incorporating these insights into their decision-making processes. Policymakers can use this information to design more effective regulations and policies promoting economic stability and consumer welfare. The study emphasizes the need for using diversified model to understand the market dynamics and improving trading strategies.Originality/valueThis research, to the best of our knowledge, is the first to use the above models to develop an ensemble model on the data for which the outliers have not been adjusted, and the model still outperformed the other statistical, econometric, ML and deep learning (DL) models.