Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Provable Boolean interaction recovery from tree ensemble obtained via random forests

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S±. Intuitively speaking, DWP(S±) measures how frequently features in S± appear together in an RF tree ensemble. We prove that, with high probability, DWP(S±) attains a universal upper bound that does not involve any model coefficients, if and only if S± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.1371/journal.pone.0298906
Learning epistatic polygenic phenotypes with Boolean interactions
  • Apr 16, 2024
  • PLOS ONE
  • Merle Behr + 11 more

Detecting epistatic drivers of human phenotypes is a considerable challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Predictability, Computability, Stability (PCS) framework, we introduce the epiTree pipeline to extract higher-order interactions from genomic data using tree-based models. The epiTree pipeline first selects a set of variants derived from tissue-specific estimates of gene expression. Next, it uses iterative random forests (iRF) to search training data for candidate Boolean interactions (pairwise and higher-order). We derive significance tests for interactions, based on a stabilized likelihood ratio test, by simulating Boolean tree-structured null (no epistasis) and alternative (epistasis) distributions on hold-out test data. Finally, our pipeline computes PCS epistasis p-values that probabilisticly quantify improvement in prediction accuracy via bootstrap sampling on the test set. We validate the epiTree pipeline in two case studies using data from the UK Biobank: predicting red hair and multiple sclerosis (MS). In the case of predicting red hair, epiTree recovers known epistatic interactions surrounding MC1R and novel interactions, representing non-linearities not captured by logistic regression models. In the case of predicting MS, a more complex phenotype than red hair, epiTree rankings prioritize novel interactions surrounding HLA-DRB1, a variant previously associated with MS in several populations. Taken together, these results highlight the potential for epiTree rankings to help reduce the design space for follow up experiments.

  • Research Article
  • Cite Count Icon 45
  • 10.1016/j.csbj.2022.06.037
Evaluating the performance of random forest and iterative random forest based methods when applied to gene expression data
  • Jan 1, 2022
  • Computational and Structural Biotechnology Journal
  • Angelica M Walker + 7 more

Evaluating the performance of random forest and iterative random forest based methods when applied to gene expression data

  • Single Report
  • Cite Count Icon 1
  • 10.2172/2472741
FORESTR: Finding, Organizing, Representing, Explaining, Summarizing, and Thinning Random forests
  • Sep 1, 2024
  • Katherine Goode + 1 more

Random forests have become popular models used for data driven predictions. As a result, random forests are currently used or being considered for high-consequence mission applications in national security, such as the prediction of yield from optical signals and malware detection. While random forests may provide accurate predictions, the complexity of the algorithm causes a lack of interpretability. Random forests are an ensemble of regression or decision trees. Individual regression and decision trees are interpretable, but ensembles are inherently difficult to interpret due to the compilation of many models. We aim to increase the interpretability of random forests by finding patterns in the ensemble of trees that can be used to “thin” (or remove) trees. As a starting point, in this report, we develop a new distance metric for quantifying the similarity between trees based on their topologies (i.e., shapes). We base the metric on a novel distance metric for graphs that is a proper mathematical distance, is invariant to transformations, has registration between graphs, and computes topological evolutions between graphs. We use the tree distance metric to compute tree statistics such as a “mean tree” and to identify clusters of trees. We apply the developed methodology to a toy dataset and a mission relevant product inspection dataset to demonstrate how the metric can provide insight into random forests. Furthermore, we discuss the limitations of the approach and ideas for future research into how the metric could be used as a thinning tool to develop less complex models.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 21
  • 10.1371/journal.pone.0190826
A Kolmogorov-Smirnov test for the molecular clock based on Bayesian ensembles of phylogenies.
  • Jan 4, 2018
  • PLOS ONE
  • Fernando Antoneli + 3 more

Divergence date estimates are central to understand evolutionary processes and depend, in the case of molecular phylogenies, on tests of molecular clocks. Here we propose two non-parametric tests of strict and relaxed molecular clocks built upon a framework that uses the empirical cumulative distribution (ECD) of branch lengths obtained from an ensemble of Bayesian trees and well known non-parametric (one-sample and two-sample) Kolmogorov-Smirnov (KS) goodness-of-fit test. In the strict clock case, the method consists in using the one-sample Kolmogorov-Smirnov (KS) test to directly test if the phylogeny is clock-like, in other words, if it follows a Poisson law. The ECD is computed from the discretized branch lengths and the parameter λ of the expected Poisson distribution is calculated as the average branch length over the ensemble of trees. To compensate for the auto-correlation in the ensemble of trees and pseudo-replication we take advantage of thinning and effective sample size, two features provided by Bayesian inference MCMC samplers. Finally, it is observed that tree topologies with very long or very short branches lead to Poisson mixtures and in this case we propose the use of the two-sample KS test with samples from two continuous branch length distributions, one obtained from an ensemble of clock-constrained trees and the other from an ensemble of unconstrained trees. Moreover, in this second form the test can also be applied to test for relaxed clock models. The use of a statistically equivalent ensemble of phylogenies to obtain the branch lengths ECD, instead of one consensus tree, yields considerable reduction of the effects of small sample size and provides a gain of power.

  • Research Article
  • Cite Count Icon 336
  • 10.1073/pnas.1711236115
Iterative random forests to discover predictive and stable high-order interactions
  • Jan 19, 2018
  • Proceedings of the National Academy of Sciences of the United States of America
  • Sumanta Basu + 3 more

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on random forests (RFs) and random intersection trees (RITs) and through extensive, biologically inspired simulations, we developed the iterative random forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with the same order of computational cost as the RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human-derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, third-order interactions, e.g., between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF rediscovered a central role of H3K36me3 in chromatin-mediated splicing regulation and identified interesting fifth- and sixth-order interactions, indicative of multivalent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens additional avenues of inquiry into the molecular mechanisms underlying genome biology.

  • Research Article
  • Cite Count Icon 116
  • 10.1109/tsg.2010.2052935
Catastrophe Predictors From Ensemble Decision-Tree Learning of Wide-Area Severity Indices
  • Sep 1, 2010
  • IEEE Transactions on Smart Grid
  • Innocent Kamwa + 2 more

Catastrophe precursors are essential prerequisites for response-based remedial action schemes, at both the protective and the operator levels. In this paper, wide-area-severity indices (WASI) derived from PMU measurements serve as the basis for building fast catastrophe predictors using random-forest (RF) learning. Given the randomness in the ensemble of decision trees (DTs) stacked in the RF model, it can provide at the recall stage not only an early assessment of the stable/unstable status of an ongoing contingency but also a probability outcome which quantifies the confidence level of the decision. This methodology, which to the best of our knowledge is new to the dynamic security assessment (DSA) of power systems, is also very effective in evaluating the importance of and interaction among the various WASI input features. Our research unexpectedly showed that the ensemble of trees in the RF is very robust in the presence of small changes in the training data and generalize across widely different network dynamics. Thus, the same RF performed very well on a large database with more than 60 000 instances from a test system (10%) and an actual (90%) system combined. One such a general RF (with 210 trees) boosted the reliability of a 9-cycle catastrophe predictor to 99.9%, compared to only 70% when a single conventionally trained DT is used.

  • Research Article
  • Cite Count Icon 12
  • 10.5846/stxb201306031292
随机森林算法基本思想及其在生态学中的应用———以云南松分布模拟为例
  • Jan 1, 2014
  • Acta Ecologica Sinica
  • 张雷 Zhang Lei + 5 more

Ecological data are often complex. The explanatory and the response variables may be categorical variables or numerical variables. The ecological relationships that need to be defined are often nonlinear and involve high-order interactions between explanatory variables. Missing values for both response and predictor variables are very common,and outliers almost always exist. Random forest( RF),a novel machine learning technique,is ideally suited for the analysis ofcomplex ecological data. RF predictors are a ensemble-learning approach based on regression or classification trees. Instead of building one classification tree( classifier),the RF algorithm builds multiple classifiers using randomly selected subsets of the observations and random subsets of the predictor variables. The predictions from the ensemble of trees are then averaged in the case of regression trees,or tallied using a voting system for classification trees. RF is efficient to support flexible modelling strategies. RF is capable of detecting and making use of more complex relationships among the variables. RF is unexcelled in accuracy among current algorithms and does not overfit. It also generates an internal unbiased estimate of the generalization error as the forest building progresses. Potential applications of RF to ecology include: classification and regression analysis,survival analysis,variable importance estimate and data proximities. Proximities can be used for clustering,detecting outliers,multi-dimensional scaling,and unsupervised classification. RF can interpolate missing value and maintain high accuracy even when a large proportion of the data are missing. RF can handle thousands of input variables without variable exclusion. It runs efficiently on large data bases. RF can also handle a spectrum of response types, including categorical,numeric,ratings,and survival data. Another advantage of the RF is that it requires only two userdefined parameters( The number of trees and the number of randomly selected predictive variables used to split the nodes) to be defined. These two parameters should be optimized in order to improve predictive accuracy. In recent years,RF has been widely used by ecologists to model complex ecological relationships because they are easy to implement and easy to interpret. To understand and use the RF,further information about how they are computed is useful. Here,we summarized the basic principle of RF and showed how RF handle complex data by modelling the geographical distribution of Yunan Pine( Pinus yunnanensis) in China. RF is a robust and widely used technique in the field of species distribution modelling( SDM),since it meets the basic needs of SDM: simulating species distribution and identifying the main drivers of species distribution. In this work,RF showed a high predictive performance in simulating the distribution of Yunan Pine,which was consistent with the multi-dimensional scaling plot that showed it was possible to separate the presences from the absences. We also estimated the relative importance of predictor variables and produced the partial dependence plots for selected predictor variables for random forest predictions of the presences of Yunan Pine. The main aim of the article is to familiarize the reader with the general concepts,terminology and basic principle behind RF. We believe RF will get more applications and development in ecology.

  • Research Article
  • Cite Count Icon 30
  • 10.1080/01431161.2017.1372863
Enhanced decision tree ensembles for land-cover mapping from fully polarimetric SAR data
  • Aug 31, 2017
  • International Journal of Remote Sensing
  • Iman Khosravi + 3 more

ABSTRACTFully polarimetric synthetic aperture radar (PolSAR) Earth Observations showed great potential for mapping and monitoring agro-environmental systems. Numerous polarimetric features can be extracted from these complex observations which may lead to improve accuracy of land-cover classification and object characterization. This article employed two well-known decision tree ensembles, i.e. bagged tree (BT) and random forest (RF), for land-cover mapping from PolSAR imagery. Moreover, two fast modified decision tree ensembles were proposed in this article, namely balanced filter-based forest (BFF) and cost-sensitive filter-based forest (CFF). These algorithms, designed based on the idea of RF, use a fast filter feature selection algorithms and two extended majority voting. They are also able to embed some solutions of imbalanced data problem into their structures. Three different PolSAR datasets, with imbalanced data, were used for evaluating efficiency of the proposed algorithms. The results indicated that all the tree ensembles have higher efficiency and reliability than the individual DT. Moreover, both proposed tree ensembles obtained higher mean overall accuracy (0.5–14% higher), producer’s accuracy (0.5–10% higher), and user’s accuracy (0.5–9% higher) than the classical tree ensembles, i.e. BT and RF. They were also much faster (e.g. 2–10 times) and more stable than their competitors for classification of these three datasets. In addition, unlike BT and RF, which obtained higher accuracy in large ensembles (i.e. the high number of DT), BFF and CFF can also be more efficient and reliable in smaller ensembles. Furthermore, the extended majority voting techniques could outperform the classical majority voting for decision fusion.

  • Research Article
  • Cite Count Icon 34
  • 10.1097/tp.0000000000002923
Seeing the Forest for the Trees: Random Forest Models for Predicting Survival in Kidney Transplant Recipients.
  • May 1, 2020
  • Transplantation
  • Ruth Sapir-Pichhadze + 1 more

Risk prediction plays an important role in clinical transplantation research. Traditionally, most risk models have been based on regression models.1 Although useful to help understand relationships between predictors and outcomes, these statistical methods can typically evaluate only a small number of predictors, which are assumed to affect everyone in the same way, and uniformly throughout the participants' lifespan. These methods have several limitations,2 including the inability to analyze nonlinear relationships, the requirement of setting a level of binary significance, impracticality for analyzing large datasets, and vulnerability to bias secondary to variable selection and/or omission of relevant confounders. With the emergence of P4 (Predictive, Preventive, Personalized, and Participatory) and Precision Medicine, artificial intelligence and machine learning methods have come to attention as methods aimed at solving the challenges in analysis not well addressed by regression approaches. Machine learning methods provide algorithms to understand patterns from large, complex, and heterogeneous data.3 Of the machine learning methods, recursive partitioning, and especially random forests, can deal with large numbers of predictor variables even in the presence of complex interactions.2,4 These methods have been applied successfully in genetics, clinical research, and bioinformatics. In this issue of Transplantation, Scheffner et al report on the development and internal validation of a random forest prediction model for patient survival.5 Random forest models are composed of a collection of decision trees. In the process of building each decision tree, different random subsets of the variables from the training dataset are selected to establish how best to partition the dataset at each node.6 Random forest models are considered less vulnerable to overfitting the training dataset given the large number of trees built, making each tree an independent model. The lower likelihood of bias is a result of bootstrapping several trees over randomly selected subsets of variables and subsamples of data.6 Random forest models require little preprocessing of data; the data need not be normalized; and the approach is resilient to outliers. While missing data will be a challenge when trying to draw clinical inferences from standard statistical models, machine learning methods tend to make fewer assumptions about the underlying data and, thus, are less vulnerable to the challenges associated with violation of those assumptions. Relying on fewer assumptions than regression analysis, machine learning methods have been shown to deliver more robust predictions. Scheffner and colleagues5 split a retrospective cohort of kidney transplant recipients with posttransplantation protocol biopsies into training and validation datasets (Figure 2A and B). Using all pretransplant and 3- and 12-months posttransplant variables, the obtained models showed good performance to predict death (concordance index: 0.77–0.78). Validation showed a concordance index of 0.76 and good discrimination of risks by the models, despite substantial differences in clinical variables and the derivation dataset representing an earlier era (2000–2007) than the validation dataset (2008–2013). To contrast with outputs of multivariable regression models using the same datasets, see Tables 2 and 3 and nomograms predicting mortality risk using estimators from multivariable Cox models (Figure 3) in Abeling et al.7 Random survival forests also inform on the importance of descriptive variables.6 Scheffner found the potentially modifiable (and highly correlated) graft rejection treatment and urinary tract infection to be important predictors of patient survival in addition to established factors like age, cardiovascular disease, diabetes, and graft function (Figure 3A and B).5 Many of the predictors retained in multivariable regression models7 were also deemed important in random forest survival analyses.5 To validate selected predictors and model construction, it is important to pursue external validation with independent datasets. Random survival forests may complement regression analyses when handling highly correlated complex survival data. Opportunities for application (and limitations) of each of the regression and random survival forests for prediction are summarized in Table 1.TABLE 1.: Regression and random survival forests for survival analysisPredictive models in transplantation and donation help risk stratify patients and could improve quality of healthcare delivery as well as patient outcomes. The increasing interest in these tools warrants a better understanding of their challenges and limitations.8 First, highly predictive variables may not necessarily be causally related to the outcomes of interest. Second, the success of machine learning models depends on the relationship between predictors and outcome being represented in training/validation datasets, the number of observations and features, selection and parameterization of features, and the algorithm chosen for the model. Careful variable definition (eg, urinary tract infection) is necessary. Presence of highly correlated linear and nonlinear relationships between independent variables may warrant mechanisms for removal of the correlated variables. Model performance may also be compromised when studying rare outcomes.4 Inevitably, generalizability of machine learning models may be limited when the clinical context, local factors (including patient/physician preferences, health systems, and care standards), and therapeutic strategies vary. To enable assessment of model validity, correct interpretation of model outputs, replication, and future knowledge synthesis, it is vital that the transplantation and donation community promote adherence to guidelines on the dissemination and reporting of machine learning models.8,9 Authors should be encouraged to report all model parameters, transformations applied to raw data, sampling methods, and random number generator seeds. Whenever possible, algorithms and associated code should be released in public software archive domains. There is a need for new models of health data ownership with rights to the individual, highly secure data repositories, government legislation for data sharing, and usage policies to ensure privacy and data security. Moreover, with wide uptake of machine learning and artificial intelligence tools, the scale of iatrogenic risks and liabilities related to their application, in contrast to the implications of a single doctor's mistake for a given patient, also warrant assessment.10 Most practice guidelines are geared toward the "average patient." Machine learning tools can capture the complexity of individual patients' characteristics and aid transplant clinicians with patient-specific care decisions. As these tools become more prevalent, it is important to develop best practice guidelines and ensure there is regulatory oversight on their development and application.

  • Research Article
  • Cite Count Icon 227
  • 10.1007/s41060-018-0144-8
Interpreting tree ensembles with inTrees
  • Jul 11, 2018
  • International Journal of Data Science and Analytics
  • Houtao Deng

Tree ensembles such as random forests and boosted trees are accurate but difficult to understand, debug and deploy. In this work, we provide the inTrees (interpretable trees) framework that extracts, measures, prunes and selects rules from a tree ensemble, and calculates frequent variable interactions. An rule-based learner, referred to as the simplified tree ensemble learner (STEL), can also be formed and used for future prediction. The inTrees framework can applied to both classification and regression problems, and is applicable to many types of tree ensembles, e.g., random forests, regularized random forests, and boosted trees. We implemented the inTrees algorithms in the "inTrees" R package.

  • PDF Download Icon
  • Book Chapter
  • Cite Count Icon 43
  • 10.1007/978-3-030-19823-7_45
Random Forest Surrogate Models to Support Design Space Exploration in Aerospace Use-Case
  • Jan 1, 2019
  • Siva Krishna Dasari + 2 more

In engineering, design analyses of complex products rely on computer simulated experiments. However, high-fidelity simulations can take significant time to compute. It is impractical to explore design space by only conducting simulations because of time constraints. Hence, surrogate modelling is used to approximate the original simulations. Since simulations are expensive to conduct, generally, the sample size is limited in aerospace engineering applications. This limited sample size, and also non-linearity and high dimensionality of data make it difficult to generate accurate and robust surrogate models. The aim of this paper is to explore the applicability of Random Forests (RF) to construct surrogate models to support design space exploration. RF generates meta-models or ensembles of decision trees, and it is capable of fitting highly non-linear data given quite small samples. To investigate the applicability of RF, this paper presents an approach to construct surrogate models using RF. This approach includes hyperparameter tuning to improve the performance of the RF’s model, to extract design parameters’ importance and if-then rules from the RF’s models for better understanding of design space. To demonstrate the approach using RF, quantitative experiments are conducted with datasets of Turbine Rear Structure use-case from an aerospace industry and results are presented.

  • Research Article
  • 10.36871/2618-9976.2024.11.005
ПРОГНОЗИРОВАНИЕ ВЕРОЯТНОСТИ ДЕФОЛТА РОССИЙСКИХ КОМПАНИЙ – ЭМИТЕНТОВ ВДО ИЗ ИНДЕКСА CBONDS CBI RU HIGH YIELD НА ОСНОВЕ МЕТОДОВ КОЛЛЕКТИВНОГО МАШИННОГО ОБУЧЕНИЯ
  • Jan 1, 2024
  • SOFT MEASUREMENTS AND COMPUTING
  • Alexandra S Pospelova + 3 more

In this scientific work, a detailed analysis of ensemble machine learning methods is carried out to assess the probability of bankruptcy of enterprises. The research focuses on the use of approaches such as Random Forest, Gradient Boosted Trees and Tree Ensemble, which can significantly improve the accuracy of predicting financial insolvency. The effective application of these methods provides investors with a powerful tool for assessing the stability of companies, improving their ability to make informed investment decisions. The implementation of such integrated technologies contributes to improving the stability of financial markets, providing investors with additional opportunities to minimize risks.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 10
  • 10.1134/s1064562421040177
Two-Level Regression Method Using Ensembles of Trees with Optimal Divergence
  • Jul 1, 2021
  • Doklady Mathematics
  • Yu I Zhuravlev + 4 more

The article discusses a new two-level regression analysis method in which a corrective procedure is applied to optimal ensembles of regression trees. Optimization is carried out based on the simultaneous achievement of the divergence of the algorithms in the forecast space and a good approximation of the data by individual algorithms of the ensemble. Simple averaging, random regression forest, and gradient boosting are used as corrective procedures. Experiments are presented comparing the proposed method with the standard decision forest and the standard gradient boosting method for decision trees.

  • Addendum
  • Cite Count Icon 38
  • 10.1016/j.matpr.2021.01.788
WITHDRAWN: Random forest algorithms for the classification of tree-based ensemble
  • Feb 1, 2021
  • Materials Today: Proceedings
  • R Madana Mohana + 3 more

WITHDRAWN: Random forest algorithms for the classification of tree-based ensemble

  • Research Article
  • Cite Count Icon 3
  • 10.1200/jco.2023.41.16_suppl.e13577
Explainable AI and machine learning algorithms to predict treatment failures for patients with cancer.
  • Jun 1, 2023
  • Journal of Clinical Oncology
  • Muddassar Farooq + 1 more

e13577 Background: Cancer patients may undergo lengthy and painful chemotherapy treatments, comprising several successive regimens or plans. Treatment inefficacy and other adverse events can lead to discontinuation (or failure) of these plans, or prematurely changing them, which results in a significant amount of physical, financial, and emotional toxicity to the patients and their families. In this research work, we build AI driven treatment failure models that utilize the real-world evidence gathered from patients’ profiles available in an oncology EMR/EHR system, with a goal of predicting the likelihood of a plan being discontinued at the time of its prescription. The selected AI models achieve a prediction accuracy of more than 80% and also provide reasons for their inference. Methods: Inclusion and Exclusion Criteria: Deidentified and anonymized electronic health records of patients, with their prescribed chemotherapies, for five different primary cancer diagnoses - ICD10 codes C18, C34, C50, C61 and C90 - that have the highest plan discontinuation rates between the years 2015 and 2022 are analyzed. All patients of other cancer types are excluded. AI Models: Unique features, that influence the treatment failure, for each cancer type are engineered by using therapeutic classification of drugs, diagnoses codes, comorbidity scores, tumor and biomarker information that is extracted from the notes and lab tests. We only use features that are available at the time of selecting a treatment plan. Several machine learning classifiers are investigated, and three tree ensembles - random forests, Xgboost and boosted forests - are further evaluated on the validation set to fine tune learning parameters with an objective to reduce the complexity of decision trees for providing better interpretability without significantly compromising the accuracy. Results: Our pilot studies reveal that boosted forests comprising of 5 random forests, each with 5 trees of depth 10 offer the best compromise between performance and interpretability. The models once trained are evaluated on unseen datasets and four performance measures of AI models are reported. On average, 15 rules are autonomously generated for a treatment failure inference for each cancer type and generally 6 of them have a significant support of 30 samples or greater. Conclusions: Machine learning algorithms for predicting treatment efficacy of chemotherapy regimens by deriving inference from the patients’ EMR/EHR data is an emerging yet challenging research domain. Our studies demonstrate that AI models like boosted forests provide the optimal models for treatment failure use case. In future, we want to validate the system in controlled clinical trials with the help of oncologists. [Table: see text]

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant