Abstract

BackgroundIt is hypothesized that common, complex diseases may be due to complex interactions between genetic and environmental factors, which are difficult to detect in high-dimensional data using traditional statistical approaches. Multifactor Dimensionality Reduction (MDR) is the most commonly used data-mining method to detect epistatic interactions. In all data-mining methods, it is important to consider internal validation procedures to obtain prediction estimates to prevent model over-fitting and reduce potential false positive findings. Currently, MDR utilizes cross-validation for internal validation. In this study, we incorporate the use of a three-way split (3WS) of the data in combination with a post-hoc pruning procedure as an alternative to cross-validation for internal model validation to reduce computation time without impairing performance. We compare the power to detect true disease causing loci using MDR with both 5- and 10-fold cross-validation to MDR with 3WS for a range of single-locus and epistatic disease models. Additionally, we analyze a dataset in HIV immunogenetics to demonstrate the results of the two strategies on real data.ResultsMDR with 3WS is computationally approximately five times faster than 5-fold cross-validation. The power to find the exact true disease loci without detecting false positive loci is higher with 5-fold cross-validation than with 3WS before pruning. However, the power to find the true disease causing loci in addition to false positive loci is equivalent to the 3WS. With the incorporation of a pruning procedure after the 3WS, the power of the 3WS approach to detect only the exact disease loci is equivalent to that of MDR with cross-validation. In the real data application, the cross-validation and 3WS analyses indicate the same two-locus model.ConclusionsOur results reveal that the performance of the two internal validation methods is equivalent with the use of pruning procedures. The specific pruning procedure should be chosen understanding the trade-off between identifying all relevant genetic effects but including false positives and missing important genetic factors. This implies 3WS may be a powerful and computationally efficient approach to screen for epistatic effects, and could be used to identify candidate interactions in large-scale genetic studies.

Highlights

  • It is hypothesized that common, complex diseases may be due to complex interactions between genetic and environmental factors, which are difficult to detect in high-dimensional data using traditional statistical approaches

  • It has been previously shown that increasing the number of nuisance loci does not affect the power of Multifactor Dimensionality Reduction (MDR) to detect the true disease causing loci [17], but we compare the results of CV-5, CV-10, and 3WS for both the XOR and ZZ models using data generated with 100 total loci in addition to 25 loci

  • For both the XOR and ZZ models, results for CV10, CV-5, and 3WS are similar when the number of total loci was increased from 25 to 100, which is consistent with the assumption that the addition of nuisance loci does not affect the power of MDR

Read more

Summary

Introduction

It is hypothesized that common, complex diseases may be due to complex interactions between genetic and environmental factors, which are difficult to detect in high-dimensional data using traditional statistical approaches. Due to the large-scale nature of genetic data (in terms of the number of markers evaluated), we need methods to simultaneously build disease models, perform variable selection, and control for possible false positive findings, which all become inherently more difficult when high-order genetic interactions are considered [2,5,6]. Much work has been done on MDR and many extensions have been developed [8,9,10,11,12,13,14,15] It is arguably one of the most commonly used data-mining methods in genetic epidemiology [7,16], and has been highly successful in a wide range of simulations [6,17,18,19,20] and real data applications, including investigations of multiple sclerosis [21,22], schizophrenia [23], and endophenotypes in breast cancer [24]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call