Abstract
MotivationThis work uses the Random Forest (RF) classification algorithm to predict if a gene is over-expressed, under-expressed or has no change in expression with age in the brain. RFs have high predictive power, and RF models can be interpreted using a feature (variable) importance measure. However, current feature importance measures evaluate a feature as a whole (all feature values). We show that, for a popular type of biological data (Gene Ontology-based), usually only one value of a feature is particularly important for classification and the interpretation of the RF model. Hence, we propose a new algorithm for identifying the most important and most informative feature values in an RF model.ResultsThe new feature importance measure identified highly relevant Gene Ontology terms for the aforementioned gene classification task, producing a feature ranking that is much more informative to biologists than an alternative, state-of-the-art feature importance measure.Availability and implementationThe dataset and source codes used in this paper are available as ‘Supplementary Material’ and the description of the data can be found at: https://fabiofabris.github.io/bioinfo2018/web/.Supplementary information Supplementary data are available at Bioinformatics online.
Highlights
In this work, we focus on predicting genes with altered expression with age in the brain
For a popular type of biological data (Gene Ontology-based), usually only one value of a feature is important for classification and the interpretation of the Random Forest (RF) model
Existing measures of feature importance for RFs do not differentiate between positive and negative feature values
Summary
We focus on predicting genes with altered expression with age in the brain. The RF algorithm is very popular in machine learning and bioinformatics (Touw et al, 2013) due to its high predictive accuracy and the use of variable importance measures (VIMs) These measures allow us to identify the most important variables for classification in the model (a set of partly random decision trees) built by the RF algorithm. The RF algorithm, which is widely used for classification in bioinformatics, builds nTree (a parameter) Random Trees (RT) during its training phase. This involves randomizing the training set in two ways for each RT: first, the training set is re-sampled with replacement, maintaining the original size of the dataset. The algorithm recurses in each instance subset until a stopping criterion is met
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.