Provable Boolean interaction recovery from tree ensemble obtained via random forests

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S±. Intuitively speaking, DWP(S±) measures how frequently features in S± appear together in an RF tree ensemble. We prove that, with high probability, DWP(S±) attains a universal upper bound that does not involve any model coefficients, if and only if S± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.

Similar Papers
  • Research Article
  • Cite Count Icon 44
  • 10.1016/j.csbj.2022.06.037
Evaluating the performance of random forest and iterative random forest based methods when applied to gene expression data
  • Jan 1, 2022
  • Computational and Structural Biotechnology Journal
  • Angelica M Walker + 7 more

Evaluating the performance of random forest and iterative random forest based methods when applied to gene expression data

  • Research Article
  • Cite Count Icon 20
  • 10.1371/journal.pone.0190826
A Kolmogorov-Smirnov test for the molecular clock based on Bayesian ensembles of phylogenies.
  • Jan 4, 2018
  • PLOS ONE
  • Fernando Antoneli + 3 more

Divergence date estimates are central to understand evolutionary processes and depend, in the case of molecular phylogenies, on tests of molecular clocks. Here we propose two non-parametric tests of strict and relaxed molecular clocks built upon a framework that uses the empirical cumulative distribution (ECD) of branch lengths obtained from an ensemble of Bayesian trees and well known non-parametric (one-sample and two-sample) Kolmogorov-Smirnov (KS) goodness-of-fit test. In the strict clock case, the method consists in using the one-sample Kolmogorov-Smirnov (KS) test to directly test if the phylogeny is clock-like, in other words, if it follows a Poisson law. The ECD is computed from the discretized branch lengths and the parameter λ of the expected Poisson distribution is calculated as the average branch length over the ensemble of trees. To compensate for the auto-correlation in the ensemble of trees and pseudo-replication we take advantage of thinning and effective sample size, two features provided by Bayesian inference MCMC samplers. Finally, it is observed that tree topologies with very long or very short branches lead to Poisson mixtures and in this case we propose the use of the two-sample KS test with samples from two continuous branch length distributions, one obtained from an ensemble of clock-constrained trees and the other from an ensemble of unconstrained trees. Moreover, in this second form the test can also be applied to test for relaxed clock models. The use of a statistically equivalent ensemble of phylogenies to obtain the branch lengths ECD, instead of one consensus tree, yields considerable reduction of the effects of small sample size and provides a gain of power.

  • Research Article
  • Cite Count Icon 332
  • 10.1073/pnas.1711236115
Iterative random forests to discover predictive and stable high-order interactions
  • Jan 19, 2018
  • Proceedings of the National Academy of Sciences of the United States of America
  • Sumanta Basu + 3 more

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on random forests (RFs) and random intersection trees (RITs) and through extensive, biologically inspired simulations, we developed the iterative random forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with the same order of computational cost as the RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human-derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, third-order interactions, e.g., between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF rediscovered a central role of H3K36me3 in chromatin-mediated splicing regulation and identified interesting fifth- and sixth-order interactions, indicative of multivalent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens additional avenues of inquiry into the molecular mechanisms underlying genome biology.

  • Research Article
  • Cite Count Icon 116
  • 10.1109/tsg.2010.2052935
Catastrophe Predictors From Ensemble Decision-Tree Learning of Wide-Area Severity Indices
  • Sep 1, 2010
  • IEEE Transactions on Smart Grid
  • Innocent Kamwa + 2 more

Catastrophe precursors are essential prerequisites for response-based remedial action schemes, at both the protective and the operator levels. In this paper, wide-area-severity indices (WASI) derived from PMU measurements serve as the basis for building fast catastrophe predictors using random-forest (RF) learning. Given the randomness in the ensemble of decision trees (DTs) stacked in the RF model, it can provide at the recall stage not only an early assessment of the stable/unstable status of an ongoing contingency but also a probability outcome which quantifies the confidence level of the decision. This methodology, which to the best of our knowledge is new to the dynamic security assessment (DSA) of power systems, is also very effective in evaluating the importance of and interaction among the various WASI input features. Our research unexpectedly showed that the ensemble of trees in the RF is very robust in the presence of small changes in the training data and generalize across widely different network dynamics. Thus, the same RF performed very well on a large database with more than 60 000 instances from a test system (10%) and an actual (90%) system combined. One such a general RF (with 210 trees) boosted the reliability of a 9-cycle catastrophe predictor to 99.9%, compared to only 70% when a single conventionally trained DT is used.

  • Research Article
  • Cite Count Icon 23
  • 10.1186/s12859-019-3104-y
Network inference with ensembles of bi-clustering trees
  • Oct 28, 2019
  • BMC Bioinformatics
  • Konstantinos Pliakos + 1 more

BackgroundNetwork inference is crucial for biomedicine and systems biology. Biological entities and their associations are often modeled as interaction networks. Examples include drug protein interaction or gene regulatory networks. Studying and elucidating such networks can lead to the comprehension of complex biological processes. However, usually we have only partial knowledge of those networks and the experimental identification of all the existing associations between biological entities is very time consuming and particularly expensive. Many computational approaches have been proposed over the years for network inference, nonetheless, efficiency and accuracy are still persisting open problems. Here, we propose bi-clustering tree ensembles as a new machine learning method for network inference, extending the traditional tree-ensemble models to the global network setting. The proposed approach addresses the network inference problem as a multi-label classification task. More specifically, the nodes of a network (e.g., drugs or proteins in a drug-protein interaction network) are modelled as samples described by features (e.g., chemical structure similarities or protein sequence similarities). The labels in our setting represent the presence or absence of links connecting the nodes of the interaction network (e.g., drug-protein interactions in a drug-protein interaction network).ResultsWe extended traditional tree-ensemble methods, such as extremely randomized trees (ERT) and random forests (RF) to ensembles of bi-clustering trees, integrating background information from both node sets of a heterogeneous network into the same learning framework. We performed an empirical evaluation, comparing the proposed approach to currently used tree-ensemble based approaches as well as other approaches from the literature. We demonstrated the effectiveness of our approach in different interaction prediction (network inference) settings. For evaluation purposes, we used several benchmark datasets that represent drug-protein and gene regulatory networks. We also applied our proposed method to two versions of a chemical-protein association network extracted from the STITCH database, demonstrating the potential of our model in predicting non-reported interactions.ConclusionsBi-clustering trees outperform existing tree-based strategies as well as machine learning methods based on other algorithms. Since our approach is based on tree-ensembles it inherits the advantages of tree-ensemble learning, such as handling of missing values, scalability and interpretability.

  • Research Article
  • Cite Count Icon 12
  • 10.5846/stxb201306031292
随机森林算法基本思想及其在生态学中的应用———以云南松分布模拟为例
  • Jan 1, 2014
  • Acta Ecologica Sinica
  • 张雷 Zhang Lei + 5 more

Ecological data are often complex. The explanatory and the response variables may be categorical variables or numerical variables. The ecological relationships that need to be defined are often nonlinear and involve high-order interactions between explanatory variables. Missing values for both response and predictor variables are very common,and outliers almost always exist. Random forest( RF),a novel machine learning technique,is ideally suited for the analysis ofcomplex ecological data. RF predictors are a ensemble-learning approach based on regression or classification trees. Instead of building one classification tree( classifier),the RF algorithm builds multiple classifiers using randomly selected subsets of the observations and random subsets of the predictor variables. The predictions from the ensemble of trees are then averaged in the case of regression trees,or tallied using a voting system for classification trees. RF is efficient to support flexible modelling strategies. RF is capable of detecting and making use of more complex relationships among the variables. RF is unexcelled in accuracy among current algorithms and does not overfit. It also generates an internal unbiased estimate of the generalization error as the forest building progresses. Potential applications of RF to ecology include: classification and regression analysis,survival analysis,variable importance estimate and data proximities. Proximities can be used for clustering,detecting outliers,multi-dimensional scaling,and unsupervised classification. RF can interpolate missing value and maintain high accuracy even when a large proportion of the data are missing. RF can handle thousands of input variables without variable exclusion. It runs efficiently on large data bases. RF can also handle a spectrum of response types, including categorical,numeric,ratings,and survival data. Another advantage of the RF is that it requires only two userdefined parameters( The number of trees and the number of randomly selected predictive variables used to split the nodes) to be defined. These two parameters should be optimized in order to improve predictive accuracy. In recent years,RF has been widely used by ecologists to model complex ecological relationships because they are easy to implement and easy to interpret. To understand and use the RF,further information about how they are computed is useful. Here,we summarized the basic principle of RF and showed how RF handle complex data by modelling the geographical distribution of Yunan Pine( Pinus yunnanensis) in China. RF is a robust and widely used technique in the field of species distribution modelling( SDM),since it meets the basic needs of SDM: simulating species distribution and identifying the main drivers of species distribution. In this work,RF showed a high predictive performance in simulating the distribution of Yunan Pine,which was consistent with the multi-dimensional scaling plot that showed it was possible to separate the presences from the absences. We also estimated the relative importance of predictor variables and produced the partial dependence plots for selected predictor variables for random forest predictions of the presences of Yunan Pine. The main aim of the article is to familiarize the reader with the general concepts,terminology and basic principle behind RF. We believe RF will get more applications and development in ecology.

  • Research Article
  • Cite Count Icon 30
  • 10.1080/01431161.2017.1372863
Enhanced decision tree ensembles for land-cover mapping from fully polarimetric SAR data
  • Aug 31, 2017
  • International Journal of Remote Sensing
  • Iman Khosravi + 3 more

ABSTRACTFully polarimetric synthetic aperture radar (PolSAR) Earth Observations showed great potential for mapping and monitoring agro-environmental systems. Numerous polarimetric features can be extracted from these complex observations which may lead to improve accuracy of land-cover classification and object characterization. This article employed two well-known decision tree ensembles, i.e. bagged tree (BT) and random forest (RF), for land-cover mapping from PolSAR imagery. Moreover, two fast modified decision tree ensembles were proposed in this article, namely balanced filter-based forest (BFF) and cost-sensitive filter-based forest (CFF). These algorithms, designed based on the idea of RF, use a fast filter feature selection algorithms and two extended majority voting. They are also able to embed some solutions of imbalanced data problem into their structures. Three different PolSAR datasets, with imbalanced data, were used for evaluating efficiency of the proposed algorithms. The results indicated that all the tree ensembles have higher efficiency and reliability than the individual DT. Moreover, both proposed tree ensembles obtained higher mean overall accuracy (0.5–14% higher), producer’s accuracy (0.5–10% higher), and user’s accuracy (0.5–9% higher) than the classical tree ensembles, i.e. BT and RF. They were also much faster (e.g. 2–10 times) and more stable than their competitors for classification of these three datasets. In addition, unlike BT and RF, which obtained higher accuracy in large ensembles (i.e. the high number of DT), BFF and CFF can also be more efficient and reliable in smaller ensembles. Furthermore, the extended majority voting techniques could outperform the classical majority voting for decision fusion.

  • Research Article
  • Cite Count Icon 34
  • 10.1097/tp.0000000000002923
Seeing the Forest for the Trees: Random Forest Models for Predicting Survival in Kidney Transplant Recipients.
  • May 1, 2020
  • Transplantation
  • Ruth Sapir-Pichhadze + 1 more

Risk prediction plays an important role in clinical transplantation research. Traditionally, most risk models have been based on regression models.1 Although useful to help understand relationships between predictors and outcomes, these statistical methods can typically evaluate only a small number of predictors, which are assumed to affect everyone in the same way, and uniformly throughout the participants' lifespan. These methods have several limitations,2 including the inability to analyze nonlinear relationships, the requirement of setting a level of binary significance, impracticality for analyzing large datasets, and vulnerability to bias secondary to variable selection and/or omission of relevant confounders. With the emergence of P4 (Predictive, Preventive, Personalized, and Participatory) and Precision Medicine, artificial intelligence and machine learning methods have come to attention as methods aimed at solving the challenges in analysis not well addressed by regression approaches. Machine learning methods provide algorithms to understand patterns from large, complex, and heterogeneous data.3 Of the machine learning methods, recursive partitioning, and especially random forests, can deal with large numbers of predictor variables even in the presence of complex interactions.2,4 These methods have been applied successfully in genetics, clinical research, and bioinformatics. In this issue of Transplantation, Scheffner et al report on the development and internal validation of a random forest prediction model for patient survival.5 Random forest models are composed of a collection of decision trees. In the process of building each decision tree, different random subsets of the variables from the training dataset are selected to establish how best to partition the dataset at each node.6 Random forest models are considered less vulnerable to overfitting the training dataset given the large number of trees built, making each tree an independent model. The lower likelihood of bias is a result of bootstrapping several trees over randomly selected subsets of variables and subsamples of data.6 Random forest models require little preprocessing of data; the data need not be normalized; and the approach is resilient to outliers. While missing data will be a challenge when trying to draw clinical inferences from standard statistical models, machine learning methods tend to make fewer assumptions about the underlying data and, thus, are less vulnerable to the challenges associated with violation of those assumptions. Relying on fewer assumptions than regression analysis, machine learning methods have been shown to deliver more robust predictions. Scheffner and colleagues5 split a retrospective cohort of kidney transplant recipients with posttransplantation protocol biopsies into training and validation datasets (Figure 2A and B). Using all pretransplant and 3- and 12-months posttransplant variables, the obtained models showed good performance to predict death (concordance index: 0.77–0.78). Validation showed a concordance index of 0.76 and good discrimination of risks by the models, despite substantial differences in clinical variables and the derivation dataset representing an earlier era (2000–2007) than the validation dataset (2008–2013). To contrast with outputs of multivariable regression models using the same datasets, see Tables 2 and 3 and nomograms predicting mortality risk using estimators from multivariable Cox models (Figure 3) in Abeling et al.7 Random survival forests also inform on the importance of descriptive variables.6 Scheffner found the potentially modifiable (and highly correlated) graft rejection treatment and urinary tract infection to be important predictors of patient survival in addition to established factors like age, cardiovascular disease, diabetes, and graft function (Figure 3A and B).5 Many of the predictors retained in multivariable regression models7 were also deemed important in random forest survival analyses.5 To validate selected predictors and model construction, it is important to pursue external validation with independent datasets. Random survival forests may complement regression analyses when handling highly correlated complex survival data. Opportunities for application (and limitations) of each of the regression and random survival forests for prediction are summarized in Table 1.TABLE 1.: Regression and random survival forests for survival analysisPredictive models in transplantation and donation help risk stratify patients and could improve quality of healthcare delivery as well as patient outcomes. The increasing interest in these tools warrants a better understanding of their challenges and limitations.8 First, highly predictive variables may not necessarily be causally related to the outcomes of interest. Second, the success of machine learning models depends on the relationship between predictors and outcome being represented in training/validation datasets, the number of observations and features, selection and parameterization of features, and the algorithm chosen for the model. Careful variable definition (eg, urinary tract infection) is necessary. Presence of highly correlated linear and nonlinear relationships between independent variables may warrant mechanisms for removal of the correlated variables. Model performance may also be compromised when studying rare outcomes.4 Inevitably, generalizability of machine learning models may be limited when the clinical context, local factors (including patient/physician preferences, health systems, and care standards), and therapeutic strategies vary. To enable assessment of model validity, correct interpretation of model outputs, replication, and future knowledge synthesis, it is vital that the transplantation and donation community promote adherence to guidelines on the dissemination and reporting of machine learning models.8,9 Authors should be encouraged to report all model parameters, transformations applied to raw data, sampling methods, and random number generator seeds. Whenever possible, algorithms and associated code should be released in public software archive domains. There is a need for new models of health data ownership with rights to the individual, highly secure data repositories, government legislation for data sharing, and usage policies to ensure privacy and data security. Moreover, with wide uptake of machine learning and artificial intelligence tools, the scale of iatrogenic risks and liabilities related to their application, in contrast to the implications of a single doctor's mistake for a given patient, also warrant assessment.10 Most practice guidelines are geared toward the "average patient." Machine learning tools can capture the complexity of individual patients' characteristics and aid transplant clinicians with patient-specific care decisions. As these tools become more prevalent, it is important to develop best practice guidelines and ensure there is regulatory oversight on their development and application.

  • Research Article
  • Cite Count Icon 225
  • 10.1007/s41060-018-0144-8
Interpreting tree ensembles with inTrees
  • Jul 11, 2018
  • International Journal of Data Science and Analytics
  • Houtao Deng

Tree ensembles such as random forests and boosted trees are accurate but difficult to understand, debug and deploy. In this work, we provide the inTrees (interpretable trees) framework that extracts, measures, prunes and selects rules from a tree ensemble, and calculates frequent variable interactions. An rule-based learner, referred to as the simplified tree ensemble learner (STEL), can also be formed and used for future prediction. The inTrees framework can applied to both classification and regression problems, and is applicable to many types of tree ensembles, e.g., random forests, regularized random forests, and boosted trees. We implemented the inTrees algorithms in the "inTrees" R package.

  • PDF Download Icon
  • Book Chapter
  • Cite Count Icon 42
  • 10.1007/978-3-030-19823-7_45
Random Forest Surrogate Models to Support Design Space Exploration in Aerospace Use-Case
  • Jan 1, 2019
  • Siva Krishna Dasari + 2 more

In engineering, design analyses of complex products rely on computer simulated experiments. However, high-fidelity simulations can take significant time to compute. It is impractical to explore design space by only conducting simulations because of time constraints. Hence, surrogate modelling is used to approximate the original simulations. Since simulations are expensive to conduct, generally, the sample size is limited in aerospace engineering applications. This limited sample size, and also non-linearity and high dimensionality of data make it difficult to generate accurate and robust surrogate models. The aim of this paper is to explore the applicability of Random Forests (RF) to construct surrogate models to support design space exploration. RF generates meta-models or ensembles of decision trees, and it is capable of fitting highly non-linear data given quite small samples. To investigate the applicability of RF, this paper presents an approach to construct surrogate models using RF. This approach includes hyperparameter tuning to improve the performance of the RF’s model, to extract design parameters’ importance and if-then rules from the RF’s models for better understanding of design space. To demonstrate the approach using RF, quantitative experiments are conducted with datasets of Turbine Rear Structure use-case from an aerospace industry and results are presented.

  • Research Article
  • 10.36871/2618-9976.2024.11.005
ПРОГНОЗИРОВАНИЕ ВЕРОЯТНОСТИ ДЕФОЛТА РОССИЙСКИХ КОМПАНИЙ – ЭМИТЕНТОВ ВДО ИЗ ИНДЕКСА CBONDS CBI RU HIGH YIELD НА ОСНОВЕ МЕТОДОВ КОЛЛЕКТИВНОГО МАШИННОГО ОБУЧЕНИЯ
  • Jan 1, 2024
  • SOFT MEASUREMENTS AND COMPUTING
  • Alexandra S Pospelova + 3 more

In this scientific work, a detailed analysis of ensemble machine learning methods is carried out to assess the probability of bankruptcy of enterprises. The research focuses on the use of approaches such as Random Forest, Gradient Boosted Trees and Tree Ensemble, which can significantly improve the accuracy of predicting financial insolvency. The effective application of these methods provides investors with a powerful tool for assessing the stability of companies, improving their ability to make informed investment decisions. The implementation of such integrated technologies contributes to improving the stability of financial markets, providing investors with additional opportunities to minimize risks.

  • Research Article
  • Cite Count Icon 9
  • 10.1134/s1064562421040177
Two-Level Regression Method Using Ensembles of Trees with Optimal Divergence
  • Jul 1, 2021
  • Doklady Mathematics
  • Yu I Zhuravlev + 4 more

The article discusses a new two-level regression analysis method in which a corrective procedure is applied to optimal ensembles of regression trees. Optimization is carried out based on the simultaneous achievement of the divergence of the algorithms in the forecast space and a good approximation of the data by individual algorithms of the ensemble. Simple averaging, random regression forest, and gradient boosting are used as corrective procedures. Experiments are presented comparing the proposed method with the standard decision forest and the standard gradient boosting method for decision trees.

  • Addendum
  • Cite Count Icon 38
  • 10.1016/j.matpr.2021.01.788
WITHDRAWN: Random forest algorithms for the classification of tree-based ensemble
  • Feb 1, 2021
  • Materials Today: Proceedings
  • R Madana Mohana + 3 more

WITHDRAWN: Random forest algorithms for the classification of tree-based ensemble

  • Research Article
  • Cite Count Icon 3
  • 10.1200/jco.2023.41.16_suppl.e13577
Explainable AI and machine learning algorithms to predict treatment failures for patients with cancer.
  • Jun 1, 2023
  • Journal of Clinical Oncology
  • Muddassar Farooq + 1 more

e13577 Background: Cancer patients may undergo lengthy and painful chemotherapy treatments, comprising several successive regimens or plans. Treatment inefficacy and other adverse events can lead to discontinuation (or failure) of these plans, or prematurely changing them, which results in a significant amount of physical, financial, and emotional toxicity to the patients and their families. In this research work, we build AI driven treatment failure models that utilize the real-world evidence gathered from patients’ profiles available in an oncology EMR/EHR system, with a goal of predicting the likelihood of a plan being discontinued at the time of its prescription. The selected AI models achieve a prediction accuracy of more than 80% and also provide reasons for their inference. Methods: Inclusion and Exclusion Criteria: Deidentified and anonymized electronic health records of patients, with their prescribed chemotherapies, for five different primary cancer diagnoses - ICD10 codes C18, C34, C50, C61 and C90 - that have the highest plan discontinuation rates between the years 2015 and 2022 are analyzed. All patients of other cancer types are excluded. AI Models: Unique features, that influence the treatment failure, for each cancer type are engineered by using therapeutic classification of drugs, diagnoses codes, comorbidity scores, tumor and biomarker information that is extracted from the notes and lab tests. We only use features that are available at the time of selecting a treatment plan. Several machine learning classifiers are investigated, and three tree ensembles - random forests, Xgboost and boosted forests - are further evaluated on the validation set to fine tune learning parameters with an objective to reduce the complexity of decision trees for providing better interpretability without significantly compromising the accuracy. Results: Our pilot studies reveal that boosted forests comprising of 5 random forests, each with 5 trees of depth 10 offer the best compromise between performance and interpretability. The models once trained are evaluated on unseen datasets and four performance measures of AI models are reported. On average, 15 rules are autonomously generated for a treatment failure inference for each cancer type and generally 6 of them have a significant support of 30 samples or greater. Conclusions: Machine learning algorithms for predicting treatment efficacy of chemotherapy regimens by deriving inference from the patients’ EMR/EHR data is an emerging yet challenging research domain. Our studies demonstrate that AI models like boosted forests provide the optimal models for treatment failure use case. In future, we want to validate the system in controlled clinical trials with the help of oncologists. [Table: see text]

  • Research Article
  • Cite Count Icon 2
  • 10.1057/s41599-022-01123-6
How to improve SME performance using iterative random forest in the empirical analysis of institutional complementaritty
  • Apr 5, 2022
  • Humanities and Social Sciences Communications
  • Atsushi Sannabe

Empirically investigating the workings of institutional complementarity in organisations has been a challenge in the social sciences domain for a long time. This paper examines data from the World Management Survey (WMS) using a new machine learning method termed as iterative random forest (iRF), which is used in the field of biostatistics. An empirical study of complementarity was conducted in small and medium-sized enterprises using WMS data. The effects of 18 management quality indicators on profitability, growth and viability were examined using machine learning methods (i.e. random forest [RF] and iRF). The analysis revealed the relative importance of whether high performers are properly rewarded, poor performers are reassigned and retrained and the criteria for high and low performance are well established. Furthermore, the study results revealed that the ability to set short-term goals based on a long-term perspective is complementary to many other indicators. These findings are consistent with the findings of a survey study that examined many empirical studies on the workings of institutional complementarity. This indicates that iRF is a credible and promising method for empirical research on institutional complementarity.

Save Icon
Up Arrow
Open/Close