Abstract

Machine learning approaches are an attractive option for analyzing large-scale data to detect genetic variants that contribute to variation of a quantitative trait, without requiring specific distributional assumptions. We evaluate two machine learning methods, random forests and logic regression, and compare them to standard simple univariate linear regression, using the Genetic Analysis Workshop 17 mini-exome data. We also apply these methods after collapsing multiple rare variants within genes and within gene pathways. Linear regression and the random forest method performed better when rare variants were collapsed based on genes or gene pathways than when each variant was analyzed separately. Logic regression performed better when rare variants were collapsed based on genes rather than on pathways.

Highlights

  • The common disease/common variant hypothesis has been successful at detecting some small to moderate genetic effects for complex traits in genome-wide association studies, a substantial proportion of the heritability remains unexplained

  • We evaluate the performance of the random forest (RF) method, logic regression (LR), and simple univariate linear regression (ULR) using the Genetic Analysis Workshop 17 (GAW17) mini-exome sequence data

  • Simple univariate linear regression Additional table 1 shows that for the ULR on uncollapsed data, only three causal variants (CVs) were significantly associated with Q2 at a Bonferroni-corrected p-value of 0.05 in one replicate each (PoR = 0.5%)

Read more

Summary

Introduction

The common disease/common variant hypothesis has been successful at detecting some small to moderate genetic effects for complex traits in genome-wide association studies, a substantial proportion of the heritability remains unexplained. The paradigm of common disease/rare variant contributions to the remaining genetic variation is of interest. New sequencing technologies have made it feasible to determine DNA sequence variations in large numbers of subjects. Machine learning approaches are attractive in terms of handling large-scale data without requiring specific distributional assumptions and are useful for detecting interaction effects of multiple predictors on a trait. The random forest (RF) method and logic regression (LR) are two machine learning methods [1]. The RF method [2] has been used in genome-wide association studies to reduce the number of genetic variants that will be used

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.