A new approach for interpreting Random Forest models and its application to the biology of ageing.

Fabio Fabris,Daniel Palmer,Alex A Freitas,João Pedro De Magalhães,Aoife Doherty

doi:10.1093/bioinformatics/bty087

Fabio Fabris, Daniel Palmer + Show 3 more

Open Access

https://doi.org/10.1093/bioinformatics/bty087

Copy DOI

Journal: Bioinformatics	Publication Date: Feb 16, 2018
Citations: 49	License type: CC BY 4.0

Affiliation: University of Kent, University of Liverpool

Abstract

MotivationThis work uses the Random Forest (RF) classification algorithm to predict if a gene is over-expressed, under-expressed or has no change in expression with age in the brain. RFs have high predictive power, and RF models can be interpreted using a feature (variable) importance measure. However, current feature importance measures evaluate a feature as a whole (all feature values). We show that, for a popular type of biological data (Gene Ontology-based), usually only one value of a feature is particularly important for classification and the interpretation of the RF model. Hence, we propose a new algorithm for identifying the most important and most informative feature values in an RF model.ResultsThe new feature importance measure identified highly relevant Gene Ontology terms for the aforementioned gene classification task, producing a feature ranking that is much more informative to biologists than an alternative, state-of-the-art feature importance measure.Availability and implementationThe dataset and source codes used in this paper are available as ‘Supplementary Material’ and the description of the data can be found at: https://fabiofabris.github.io/bioinfo2018/web/.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

In this work, we focus on predicting genes with altered expression with age in the brain
For a popular type of biological data (Gene Ontology-based), usually only one value of a feature is important for classification and the interpretation of the Random Forest (RF) model
Existing measures of feature importance for RFs do not differentiate between positive and negative feature values

Summary

Introduction

We focus on predicting genes with altered expression with age in the brain. The RF algorithm is very popular in machine learning and bioinformatics (Touw et al, 2013) due to its high predictive accuracy and the use of variable importance measures (VIMs) These measures allow us to identify the most important variables for classification in the model (a set of partly random decision trees) built by the RF algorithm. The RF algorithm, which is widely used for classification in bioinformatics, builds nTree (a parameter) Random Trees (RT) during its training phase. This involves randomizing the training set in two ways for each RT: first, the training set is re-sampled with replacement, maintaining the original size of the dataset. The algorithm recurses in each instance subset until a stopping criterion is met

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A new approach for interpreting Random Forest models and its application to the biology of ageing.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

Feature importance measures from random forest regressor using near-infrared spectra for predicting carbonization characteristics of kraft lignin-derived hydrochar
Sung-Wook Hwang ... Taekyeong Lee
Journal of Wood Science | VOL. 69
Sung-Wook Hwang, et. al.Sung-Wook Hwang ... Taekyeong Lee
05 Jan 2023
Journal of Wood Science | VOL. 69

Random Forest and Feature Importance Measures for Discriminating the Most Influential Environmental Factors in Predicting Cardiovascular and Respiratory Diseases.
Francesco Cappelli ... Vito Telesca
International journal of environmental research and public health | VOL. 21
Francesco Cappelli, et. al.Francesco Cappelli ... Vito Telesca
02 Jul 2024
International journal of environmental research and public health | VOL. 21

Permutation importance: a corrected feature importance measure
André Altmann ... Oliver Sander
Bioinformatics | VOL. 26
André Altmann, et. al.André Altmann ... Oliver Sander
12 Apr 2010
Bioinformatics | VOL. 26

Random generalized linear model: a highly accurate and interpretable ensemble predictor
Lin Song ... Steve Horvath
BMC Bioinformatics | VOL. 14
Lin Song, et. al.Lin Song ... Steve Horvath
16 Jan 2013
BMC Bioinformatics | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A new approach for interpreting Random Forest models and its application to the biology of ageing.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics