Abstract

Disease association studies aim at finding the genetic variations underlying complex human diseases in order to better understand the etiology of the disease and to provide better diagnoses, treatment, and even prevention. The non-linear interactions among multiple genetic factors play an important role in finding those genetic variations, but have not always been taken fully into account. This is due to the fact that searching combinations of interacting genetic factors becomes inhibitive as its complexity grows exponentially with the size of data. It is especially challenging for genome-wide association studies (GWAS) where typically more than a million single-nucleotide polymorphisms (SNPs) are under consideration. Dimensionality reduction is thus needed to allow us to investigate only a subset of genetic attributes that most likely have interaction effects. In this article, we conduct a comprehensive study by examining six widely used feature selection methods in machine learning for filtering interacting SNPs rather than the ones with strong individual main effects. Those six feature selection methods include chi-square, logistic regression, odds ratio, and three Relief-based algorithms. By applying all six feature selection methods to both a simulated and a real GWAS datasets, we report that Relief-based methods perform the best in filtering SNPs associated with a disease in terms of strong interaction effects.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.