Abstract
BackgroundTraditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal information into gene selection by moving from a single-input single-output to a multiple-input multiple-output setting.ResultsWe show in synthetic case study that a better prioritization of causal variables can be obtained by considering a relevance score which incorporates a causal term. In addition we show, in a meta-analysis study of six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The biological interpretation of the results confirms the potential of a causal approach to gene selection.ConclusionsIntegrating causal information into gene selection algorithms is effective both in terms of prediction accuracy and biological interpretation.
Highlights
Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations
The first one is based on a number of synthetic datasets generated by simulating a causal Bayesian network while the second relies on public microarray breast cancer datasets to assess the approach in a real data setting
Note that this causal structure aims to represent in a very simplified manner a stochastic dependency characterized by a number of indirect and direct causes, a latent non measurable variable, one observable primary target, two secondary targets, a set of additional effects and a number of independent and irrelevant variables
Summary
Traditional strategies for selecting variables in high dimensional classification problems aim to find sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in generalization accuracy they often do not reveal direct causes. A drawback is that ranking relies on univariate terms and as such it cannot take into consideration higher-order interaction terms or redundancy between features [2]. Another limitation is that ranking techniques are not able to distinguish between causes and mechanisms associated with disease and appropriate therapeutic targets
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.