Many complex diseases are thought to be caused by multiple genetic variants. Recent advances in genotyping technology allowed investigators of a complex disease to obtain data for a massive number of candidate genetic variants. Typically each candidate variant is tested individually for an association with the disease. We approach the problem as one of model selection for high dimensional data. We propose a method whereby penalised maximum likelihood estimation provides a reasonably sized set of variants for inclusion in our model. We then perform stepwise regression on this set of variants to arrive at our model. Penalised maximum likelihood estimation is performed with both the lasso and a more recently developed method known as the hyperlasso, with smoothing parameters chosen by cross-validation. The hyperlasso has a penalty function that favours sparser solutions but with less shrinkage of those variables that are included in the model, when compared to the lasso; however, this comes at extra computational cost. We apply the above method to a large genomic data set from a previously published mice obesity study and use resample model averaging to assess model performance. References Kristin A. Ayers and Heather J. Cordell. SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genetic Epidemiology, 38:879--891, 2010. doi:10.1002/gepi.20543 David J. Balding. A tutorial on statistical methods for population association studies. Nature Reviews Genetics, 7:781--791, 2006. doi:10.1038/nrg1916 Christopher S. Carlson, Michael A. Eberle, Mark J. Rieder, Qian Yi, Leonid Kruglyak, and Deborah A. Nickerson. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet., 74:106--120, 2004. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1181897/?tool=pubmed Seoae Cho, Kyunga Kim, Young Jin Kim, Jong-Keuk Lee, Yoon Shin Cho, Jong-Young Lee, Bok-Ghee Han, Heebal Kim, Jurg Ott, and Taesung Park. Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis. Annals of Human Genetics, 74:416--428, 2010. doi:10.1111/j.1469-1809.2010.00597.x {European Bioinformatics Institute}. http://www.ebi.ac.uk/projects/BARGEN/. Jianqing Fan and Jinchi Lv. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20:101--148, 2010. http://www3.stat.sinica.edu.tw/statistica/j20n1/J20N12/J20N12.html Anatole Ghazalpour, Sudheer Doss, Bin Zhang, Susanna Wang, Christopher Plaisier, Ruth Castellanos, Alec Brozell, Eric E. Schadt, Thomas A. Drake, Aldons J. Lusis, and Steve Horvath. Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genetics, 2:e130, 2006. I. Gradshteyn and I. Ryzik. Tables of Integrals, Series and Products: Corrected and Enlarged Edition. Academic Press, New York, 1980. J. E. Griffin and P. J. Brown. Bayesian adaptive lassos with non-convex penalization. Technical report, University of Kent, 2007. http://www2.warwick.ac.uk/fac/sci/statistics/crism/research/working_papers/2007/paper07-2/07-2wv2.pdf Clive J. Hoggart, John C. Whittaker, Maria {De Iorio}, and David J. Balding. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genetics, 4:e1000130, 2008. doi:10.1371/journal.pgen.1000130 B. Maher. Personal genomes: The case of the missing heritability. Nature, 456:18--21, 2008. doi:10.1038/456018a T. A. Manolio et al. Finding the missing heritability of complex diseases. Nature, 461:747--753, 2009. doi:10.1038/nature08494 Mark I. McCarthy, Goncalo R. Abecasis, Lon R. Cardon, David B. Goldstein, Julian Little, John P. A. Ioannidis, and Joel N. Hirschhorn. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics, 9:356--369, 2008. doi:10.1038/nrg2344 Nicolai Meinshausen and Peter Buehlmann. Stability selection. Journal of the Royal Statistical Society, Series B, 72:417--473, 2010. doi:10.1111/j.1467-9868.2010.00740.x R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0. http://www.r-project.org/ R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1996. http://www.jstor.org/stable/2346178 William Valdar, Christopher C. Holmes, Richard Mott, and Jonathan Flint. Mapping in structured populations by resample model averaging. Genetics, 182:1263--1277, 2009. doi:10.1534/genetics.109.100727 Susanna Wang, Nadir Yehya, Eric E. Schadt, Hui Wang, Thomas A. Drake, and Aldons J. Lusis. Genetic and genomic analysis of a fat mass trait with complex inheritance reveals marked sex specificity. PLoS Genetics, 2:e15, 2006. doi:10.1371/journal.pgen.0020015 E. T. Whittaker. On the functions associated with the parabolic cylinder in harmonic analysis. Proc. London Math. Soc., 35:417--427, 1902. doi:10.1112/plms/s1-35.1.417 Jian Yang, Beben Benyamin, Brian P. McEvoy, Scott Gordon, Anjali K. Henders, Dale R. Nyholt, et al. Common {SNPs} explain a large proportion of the heritability for human height. Nature Genetics, 42:565--569, 2010. doi:10.1038/ng.608 Gang Zheng, Jonathan Marchini, and Nancy L. Geller. Introduction to the special issue: Genome-wide association studies. Statistical Science, 24:387, 2009. doi:10.1214/09-STS310
Read full abstract