Abstract

BackgroundDiscovering relevant features (biomarkers) that discriminate etiologies of a disease is useful to provide biomedical researchers with candidate targets for further laboratory experimentation while saving costs; dependencies among biomarkers may suggest additional valuable information, for example, to characterize complex epistatic relationships from genetic data. The use of classifiers to guide the search for biomarkers (the so–called wrapper approach) has been widely studied. However, simultaneously searching for relevancy and dependencies among markers is a less explored ground.ResultsWe propose a new wrapper method that builds upon the discrimination power of a weighted kernel classifier to guide the search for a probabilistic model of simultaneous marginal and interacting effects. The feasibility of the method was evaluated in three empirical studies. The first one assessed its ability to discover complex epistatic effects on a large–scale testbed of generated human genetic problems; the method succeeded in 4 out of 5 of these problems while providing more accurate and expressive results than a baseline technique that also considers dependencies. The second study evaluated the performance of the method in benchmark classification tasks; in average the prediction accuracy was comparable to two other baseline techniques whilst finding smaller subsets of relevant features. The last study was aimed at discovering relevancy/dependency in a hepatitis dataset; in this regard, evidence recently reported in medical literature corroborated our findings. As a byproduct, the method was implemented and made freely available as a toolbox of software components deployed within an existing visual data–mining workbench.ConclusionsThe mining advantages exhibited by the method come at the expense of a higher computational complexity, posing interesting algorithmic challenges regarding its applicability to large–scale datasets. Extending the probabilistic assumptions of the method to continuous distributions and higher–degree interactions is also appealing. As a final remark, we advocate broadening the use of visual graphical software tools as they enable biodata researchers to focus on experiment design, visualisation and data analysis rather than on refining their scripting programming skills.

Highlights

  • Discovering relevant features that discriminate etiologies of a disease is useful to provide biomedical researchers with candidate targets for further laboratory experimentation while saving costs; dependencies among biomarkers may suggest additional valuable information, for example, to characterize complex epistatic relationships from genetic data

  • One of the main challenges arising in processing such huge amounts of data, is to discover from the many observed variables, those that are most relevant to explain significant patterns –or markers– of hidden concepts. This task is known as feature selection in the data–mining community or biomarker discovery in the biomedical ambit

  • Here we describe a novel method that models relevancy and dependency by coupling a weighted kernel machine for pattern classification [12] into a probabilistic–based genetic algorithm [13] for dependency estimation

Read more

Summary

Introduction

Discovering relevant features (biomarkers) that discriminate etiologies of a disease is useful to provide biomedical researchers with candidate targets for further laboratory experimentation while saving costs; dependencies among biomarkers may suggest additional valuable information, for example, to characterize complex epistatic relationships from genetic data. One of the main challenges arising in processing such huge amounts of data, is to discover from the many observed variables ( known as features), those that are most relevant to explain significant patterns –or markers– of hidden concepts This task is known as feature selection in the data–mining community or biomarker discovery in the biomedical ambit. The selected features may become targets of more detailed studies requiring expensive experimentation or human expertise, saving costs and time not spent on the discarded variables This problem of selecting the relevant variables can be regarded as a search procedure over the space of all possible combinations of variable subsets, an NP-Hard problem [1]; finding the underlying structure of a graph representing dependencies between those variables is combinatorial [2]. The need of using approximating, iterative methods is an alternative to find suitable solutions

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.