Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Maria Del Mar Muñiz Moreno,Yann Herault,Claire Gavériaux-Ruff

doi:10.1186/s12859-022-05111-0

Abstract

BackgroundIn individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often suffers from low number of samples, high number of variables or unbalanced experimental designs. Moreover, several parameters can be recorded in the same test. Thus, correlations should be assessed, and a more complex statistical framework is necessary for the analysis. Packages already exist that provide analysis tools, but they are not found together, rendering the decision method and implementation difficult for non-statisticians.ResultWe present Gdaphen, a fast joint-pipeline allowing the identification of most important qualitative and quantitative predictor variables to discriminate between genotypes, treatments, or sex. Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings. Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier’s predictive model efficiency. Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation. Moreover, Gdaphen provides the efficacy of each classifier and several visualization options to fully understand and support the results as easily readable plots ready to be included in publications. We demonstrate Gdaphen capabilities on several datasets and provide easily followable vignettes.ConclusionsGdaphen makes the analysis of phenotypic data much easier for medical or preclinical behavioral researchers, providing an integrated framework to perform: (1) pre-processing steps as data imputation or anonymization; (2) a full statistical assessment to identify which variables are the most important discriminators; and (3) state of the art visualizations ready for publication to support the conclusions of the analyses. Gdaphen is open-source and freely available at https://github.com/munizmom/gdaphen, together with vignettes, documentation for the functions and examples to guide you in each own implementation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 26, 2023
Citations: 2	License type: open-access

R Discovery Prime

R Discovery Prime

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Abstract

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Modelling Distributions of Rove Beetles in Mountainous Areas Using Remote Sensing Data
Andreas Dittrich ... Ramona Viterbi
Remote Sensing | VOL. 12
Andreas Dittrich, et. al.Andreas Dittrich ... Ramona Viterbi
24 Dec 2019
Remote Sensing | VOL. 12

Random forest identifies predictors of discharge destination following total shoulder arthroplasty
Jun Ho Chung ... Anthony Essilfie
JSES International | VOL. 8
Jun Ho Chung, et. al.Jun Ho Chung ... Anthony Essilfie
12 May 2023
JSES International | VOL. 8

Diagnostic value of qualitative and quantitative variables in thyroid lesions.
P Rout ... S Shariff
Cytopathology : official journal of the British Society for Clinical Cytology | VOL. 10
P Rout, et. al.P Rout ... S Shariff
01 May 1999
Cytopathology : official journal of the British Society for Clinical Cytology | VOL. 10

National‐scale predictions of plant assemblages via community distribution models: Leveraging published data to guide future surveys
Liam Butler ... Roy A Sanderson
Journal of Applied Ecology | VOL. 59
Liam Butler, et. al.Liam Butler ... Roy A Sanderson
27 Apr 2022
Journal of Applied Ecology | VOL. 59

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Abstract

Talk to us

Similar Papers

More From: BMC Bioinformatics