Abstract

Most of the studies on phenotype differences, including some diseases, are based on studying some specific positions in the genome called Single Nucleotide Polymorphism (SNP). Some SNPs alone and some by interacting with others, play an important role in any phenotype or specific disease. Various models, including the regression models, are designed and implemented for the prediction of these diseases. In this paper, three penalized logistic models including Ridge, Lasso and Elastic Net (EN), are used to predict the risk of a specific disease, while overcoming the limitation of the classic logistic regression on high-dimensional SNP datasets. The models are implemented on 10000 samples of the SNP datasets of OWKIN-Inserm Institute, which contains 18124 SNPs. Among these three, the Lasso model with minimizer lambda indicate higher accuracy (73.73%) and AUC (83.54%). The model is also less complex, since it eliminates less related features as much as possible and keeps only the most informative. Additionally, getting better results with Lasso indicates that multicollinearity is either not existence between variables or is low and can be neglected.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call