Real world scenarios in rare variant association analysis: the impact of imbalance and sample size on the power in silico

Xinyuan Zhang,Marylyn D. Ritchie,Sarah A. Pendergrass,Anna O. Basile

doi:10.1186/s12859-018-2591-6

Xinyuan Zhang, Marylyn D. Ritchie + Show 2 more

Open Access

https://doi.org/10.1186/s12859-018-2591-6

Copy DOI

Journal: BMC bioinformatics	Publication Date: Jan 22, 2019
Citations: 21	License type: open-access

Affiliation: University of Pennsylvania, Columbia University

Abstract

BackgroundThe development of sequencing techniques and statistical methods provides great opportunities for identifying the impact of rare genetic variation on complex traits. However, there is a lack of knowledge on the impact of sample size, case numbers, the balance of cases vs controls for both burden and dispersion based rare variant association methods. For example, Phenome-Wide Association Studies may have a wide range of case and control sample sizes across hundreds of diagnoses and traits, and with the application of statistical methods to rare variants, it is important to understand the strengths and limitations of the analyses.ResultsWe conducted a large-scale simulation of randomly selected low-frequency protein-coding regions using twelve different balanced samples with an equal number of cases and controls as well as twenty-one unbalanced sample scenarios. We further explored statistical performance of different minor allele frequency thresholds and a range of genetic effect sizes. Our simulation results demonstrate that using an unbalanced study design has an overall higher type I error rate for both burden and dispersion tests compared with a balanced study design. Regression has an overall higher type I error with balanced cases and controls, while SKAT has higher type I error for unbalanced case-control scenarios. We also found that both type I error and power were driven by the number of cases in addition to the case to control ratio under large control group scenarios. Based on our power simulations, we observed that a SKAT analysis with case numbers larger than 200 for unbalanced case-control models yielded over 90% power with relatively well controlled type I error. To achieve similar power in regression, over 500 cases are needed. Moreover, SKAT showed higher power to detect associations in unbalanced case-control scenarios than regression.ConclusionsOur results provide important insights into rare variant association study designs by providing a landscape of type I error and statistical power for a wide range of sample sizes. These results can serve as a benchmark for making decisions about study design for rare variant analyses.

Highlights

The development of sequencing techniques and statistical methods provides great opportunities for identifying the impact of rare genetic variation on complex traits
Regression had an overall higher type I error rate compared with sequence kernel association test (SKAT) for balanced samples
We investigated whether the type I error rate was driven by the ratio of the cases to controls or by the number of cases when having a large control sample

Summary

Introduction

The development of sequencing techniques and statistical methods provides great opportunities for identifying the impact of rare genetic variation on complex traits. There is a lack of knowledge on the impact of sample size, case numbers, the balance of cases vs controls for both burden and dispersion based rare variant association methods. Phenome-Wide Association Studies may have a wide range of case and control sample sizes across hundreds of diagnoses and traits, and with the application of statistical methods to rare variants, it is important to understand the strengths and limitations of the analyses. With moderately large genetic effect sizes, may explain more of the phenotypic variance of complex disease [4]. The development of sequencing technologies has increased access to rare variation data for large sample sizes. It is crucial to better understand the statistical power and analytic limitations of rare variant association approaches

Methods

Results

Conclusion