Abstract

Genome-wide association studies (GWAS) have served as primary methods for the past decade for identifying associations between genetic variants and traits or diseases. Many software packages have been developed for GWAS analysis based on different statistical models. One key factor influencing the statistical reliability of GWAS is the amount of input data used. Few studies have been conducted to investigate this effect by comparing the performance of GWAS programs using varied amounts of experimental data, especially in the context of plants and plant genomes. In this paper, we investigate how input data quantity influences output of four widely used GWAS programs, PLINK, TASSEL, GAPIT, and FaST-LMM. Both synthetic and real data are used. Standard GWAS output includes single nucleotide polymorphisms (SNPs) and their p-values. To evaluate the programs, p-values and q-values of SNPs, and Kendall rank correlation between output SNP lists, are used. Results show that with the same GWAS program, different Arabidopsis thaliana datasets demonstrate similar trends of rank correlation with varied input quantity, but differentiate on the numbers of SNPs passing a given p- or q-value threshold. In practice, experimental datasets may have samples containing varied numbers of biological replicates. We show that this variation in replicates influences the p-values of SNPs, but does not strongly affect the rank correlation. When comparing synthetic and real data, the output SNPs from synthetic data have similar rank correlation trends across all four GWAS programs, but the same measure from real data is diverse across the programs. In addition, the real data results in a linear-like increase in the numbers of significant SNPs with more input data, but the synthetic data does not follow this trend. This study provides guidance on selecting GWAS programs when varied experimental data is present and on selecting significant SNPs for subsequent study. It contributes to understanding how much input data is necessary to yield satisfying GWAS results.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call