Abstract

When designing prediction models built with many features and relatively small sample sizes, feature selection methods often overfit training data, leading to selection of irrelevant features. One way to potentially mitigate overfitting is to incorporate domain knowledge during feature selection. Here, a feature ranking algorithm called 'Family Rank' is presented in which features are ranked based on a combination of graphical domain knowledge and feature scores computed from empirical data. A simulated dataset is used to demonstrate a scenario in which family rank outperforms other state-of-the-art graph based ranking algorithms, decreasing the sample size needed to detect true predictors by 2- to 3-fold. An example from oncology is then used to explore a real-world application of family rank. An implementation of Family Rank is freely available at https://cran.r-project.org/package=FamilyRank. Supplementary data are available at Bioinformatics online.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call