Cancer is the second leading cause of disease-related death worldwide, and machine learning-based identification of novel biomarkers is crucial for improving early detection and treatment of various cancers. A key challenge in applying machine learning to high-dimensional data is deriving important features in an interpretable manner to provide meaningful insights into the underlying biological mechanismsWe developed a class-based directional feature importance (CLIFI) metric for decision tree methods and demonstrated its use for the The Cancer Genome Atlas proteomics data. The CLIFI metric was incorporated into four algorithms, Random Forest (RF), LAtent VAriable Stochastic Ensemble of Trees (LAVASET), and Gradient Boosted Decision Trees (GBDTs), and a new extension incorporating the LAVA step into GBDTs (LAVABOOST). Both LAVA methods incorporate topological information from protein interactions into the decision function.The different models' performance in classifying 28 cancers resulted in F1-scores of 92.6% (RF), 92.0% (LAVASET), 89.3% (LAVABOOST) and 85.7% (GBDT), with no method outperforming all others for individual cancer type prediction. The CLIFI metric enables visualisation of the model's decision-making functions. The resulting CLIFI value distributions indicated heterogeneity in the expression of several proteins (MYH11, ERα, BCL2) across different cancer types (including brain glioma, breast, kidney, thyroid and prostate cancer) aligning with the original raw expression data.In conclusion, we have developed an integrated, directional feature importance metric for multi-class decision tree-based classification models that facilitates interpretable feature importance assessment. The CLIFI metric can be combined with incorporating topological information into the decision functions of models to introduce inductive bias, enhancing interpretability.
Read full abstract