Abstract

BackgroundIncreasingly, molecular measurements from multiple studies are pooled to identify risk scores, with only partial overlap of measurements available from different studies. Univariate analyses of such markers have routinely been performed in such settings using meta-analysis techniques in genome-wide association studies for identifying genetic risk scores. In contrast, multivariable techniques such as regularized regression, which might potentially be more powerful, are hampered by only partial overlap of available markers even when the pooling of individual level data is feasible for analysis. This cannot easily be addressed at a preprocessing level, as quality criteria in the different studies may result in differential availability of markers – even after imputation.MethodsMotivated by data from the InterLymph Consortium on risk factors for non-Hodgkin lymphoma, which exhibits these challenges, we adapted a regularized regression approach, componentwise boosting, for dealing with partial overlap in SNPs. This synthesis regression approach is combined with resampling to determine stable sets of single nucleotide polymorphisms, which could feed into a genetic risk score. The proposed approach is contrasted with univariate analyses, an application of the lasso, and with an analysis that discards studies causing the partial overlap. The question of statistical significance is faced with an approach called stability selection.ResultsUsing an excerpt of the data from the InterLymph Consortium on two specific subtypes of non-Hodgkin lymphoma, it is shown that componentwise boosting can take into account all applicable information from different SNPs, irrespective of whether they are covered by all investigated studies and for all individuals in the single studies. The results indicate increased power, even when studies that would be discarded in a complete case analysis only comprise a small proportion of individuals.ConclusionsGiven the observed gains in power, the proposed approach can be recommended more generally whenever there is only partial overlap of molecular measurements obtained from pooled studies and/or missing data in single studies. A corresponding software implementation is available upon request.Trial registrationAll involved studies have provided signed GWAS data submission certifications to the U.S. National Institute of Health and have been retrospectively registered.

Highlights

  • Molecular measurements from multiple studies are pooled to identify risk scores, with only partial overlap of measurements available from different studies

  • Combining casecontrol studies with measurements of single nucleotide polymorphisms (SNPs) into large genome-wide association studies (GWAS) has allowed investigations into even very rare risk variants for some diseases [1]. Some of these consortia, such as the InterLymph Consortium on non-Hodgkin lymphoma (NHL) [2,3,4,5,6,7,8,9], allow for combining aggregate per-Single nucleotide polymorphism (SNP) statistics from each participating study, but provide individual level data from all studies for joint analysis. This opens the way for more sophisticated analyses, but any approach must contend with only partial overlap of the SNPs available from different studies due to differences in genotyping platform, quality control, and imputation approaches

  • We briefly describe the data from the InterLymph Consortium and propose the adaptation of componentwise boosting for synthesis regression in the Methods section

Read more

Summary

Introduction

Molecular measurements from multiple studies are pooled to identify risk scores, with only partial overlap of measurements available from different studies. Combining casecontrol studies with measurements of single nucleotide polymorphisms (SNPs) into large genome-wide association studies (GWAS) has allowed investigations into even very rare risk variants for some diseases [1] Some of these consortia, such as the InterLymph Consortium on non-Hodgkin lymphoma (NHL) [2,3,4,5,6,7,8,9], allow for combining aggregate per-SNP statistics from each participating study, but provide individual level data from all studies for joint analysis. [10] suggested an approach based on group lasso, and [11] considers a hybrid approach combining linear mixed models and sparse regression models, a so-called Bayesian sparse linear mixed model

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.