RMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool for Genome-wide Association Study

Lilin Yin,Haohao Zhang,Zhenshuang Tang,Jingya Xu,Dong Yin,Zhiwu Zhang,Xiaohui Yuan,Mengjin Zhu,Shuhong Zhao,Xinyun Li,Xiaolei Liu

doi:10.1016/j.gpb.2020.10.007

Abstract

Along with the development of high-throughput sequencing technologies, both sample size and SNP number are increasing rapidly in genome-wide association studies (GWAS), and the associated computation is more challenging than ever. Here, we present a memory-efficient, visualization-enhanced, and parallel-accelerated R package called “rMVP” to address the need for improved GWAS computation. rMVP can 1) effectively process large GWAS data, 2) rapidly evaluate population structure, 3) efficiently estimate variance components by Efficient Mixed-Model Association eXpedited (EMMAX), Factored Spectrally Transformed Linear Mixed Models (FaST-LMM), and Haseman-Elston (HE) regression algorithms, 4) implement parallel-accelerated association tests of markers using general linear model (GLM), mixed linear model (MLM), and fixed and random model circulating probability unification (FarmCPU) methods, 5) compute fast with a globally efficient design in the GWAS processes, and 6) generate various visualizations of GWAS-related information. Accelerated by block matrix multiplication strategy and multiple threads, the association test methods embedded in rMVP are significantly faster than PLINK, GEMMA, and FarmCPU_pkg. rMVP is freely available at https://github.com/xiaolei-lab/rMVP.

Highlights

The computation burden of Genome-Wide Association Studies (GWAS) is partially caused by the increasing sample size and marker density applied for these studies
To address all of the above requirements, we developed the Memory-efficient, Visualization-enhanced, and Parallel-accelerated package in R
Genotype matrices are the biggest datasets for GWAS

Summary

Introduction

The computation burden of GWAS is partially caused by the increasing sample size and marker density applied for these studies. How to efficiently analyse the big data is a big challenge. GWAS have been widely used for detecting candidate genes that control human diseases and agricultural economic traits, where the accuracy of the results is of significant implications. How to achieve higher statistical power under a reasonable level of type I error is another challenge[1]. To efficiently detect more candidate genes with lower false positive rates is the current working goal for GWAS algorithms and tools[2, 3]. Introducing the population structure concept into GWAS has dramatically improved accuracy of detection. Incorporating the fractions of individuals belonging to subpopulations, namely Q matrix, reduces both false positive and false negative signals[4]

Methods

Results

Conclusion