Comparing machine learning and logistic regression methods for predicting hypertension using a combination of gene expression and next-generation sequencing data.

Elizabeth Held,Nathan Tintle,Joshua Cape

doi:10.1186/s12919-016-0020-2

Elizabeth Held, Nathan Tintle + Show 1 more

Open Access

https://doi.org/10.1186/s12919-016-0020-2

Copy DOI

Abstract

Machine learning methods continue to show promise in the analysis of data from genetic association studies because of the high number of variables relative to the number of observations. However, few best practices exist for the application of these methods. We extend a recently proposed supervised machine learning approach for predicting disease risk by genotypes to be able to incorporate gene expression data and rare variants. We then apply 2 different versions of the approach (radial and linear support vector machines) to simulated data from Genetic Analysis Workshop 19 and compare performance to logistic regression. Method performance was not radically different across the 3 methods, although the linear support vector machine tended to show small gains in predictive ability relative to a radial support vector machine and logistic regression. Importantly, as the number of genes in the models was increased, even when those genes contained causal rare variants, model predictive ability showed a statistically significant decrease in performance for both the radial support vector machine and logistic regression. The linear support vector machine showed more robust performance to the inclusion of additional genes. Further work is needed to evaluate machine learning approaches on larger samples and to evaluate the relative improvement in model prediction from the incorporation of gene expression data.

Highlights

Breakthroughs in genome-wide sequencing continue to motivate the development of novel methods to identify risk factors for complex diseases
Machine learning methods (MLMs) lend themselves to the genetic analysis of diseases with multiple and complex risk factors, because of the highdimensional nature of the data
Linear support vector machines (SVMs) tended to outperform both other methods by a slight margin overall, differences which were statistically significant (p

Summary

Introduction

Breakthroughs in genome-wide sequencing continue to motivate the development of novel methods to identify risk factors for complex diseases. Machine learning methods (MLMs) are statistical algorithms that allow a computer to learn from one data set (selection set) and make inferences to other data of the same nature. MLMs lend themselves to the genetic analysis of diseases with multiple and complex risk factors, because of the highdimensional nature of the data. We extend a recently proposed supervised machine learning approach [3] in order to further understand the behavior and performance of MLMs on sequence data. We incorporated a recent statistical model proposed for the joint analysis of gene expression data and genotype data in evaluating disease risk [4], along with explicit consideration of the analysis of rare variants using a collapsing (burden) approach [5].

Methods

Results

Conclusion