Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It works well under low sample size situations, which benefits applications in the field of biology. For example, gene expression data often involve much larger numbers of features (p) compared to the size of samples (n). Though the predictive accuracy using RF is often high, there are some problems when selecting important genes using RF. The important genes selected by RF are usually scattered on the gene network, which conflicts with the biological assumption of functional consistency between effective features. To improve feature selection by incorporating external topological information between genes, we propose the Graph Random Forest (GRF) for identifying highly connected important features by involving the known biological network when constructing the forest. The algorithm can identify effective features that form highly connected sub-graphs and achieve equivalent classification accuracy to RF. To evaluate the capability of our proposed method, we conducted simulation experiments and applied the method to two real datasets-non-small cell lung cancer RNA-seq data from The Cancer Genome Atlas, and human embryonic stem cell RNA-seq dataset (GSE93593). The resulting high classification accuracy, connectivity of selected sub-graphs, and interpretable feature selection results suggest the method is a helpful addition to graph-based classification models and feature selection procedures.
Read full abstract