Evaluation of the Minimum Sampling Design for Population Genomic and Microsatellite Studies: An Analysis Based on Wild Maize.

Jonás A Aguirre-Liguori,Jaime Gasca-Pineda,Luis E Eguiarte,Javier A Luna-Sánchez

doi:10.3389/fgene.2020.00870

Abstract

Massive parallel sequencing (MPS) is revolutionizing the field of molecular ecology by allowing us to understand better the evolutionary history of populations and species, and to detect genomic regions that could be under selection. However, the economic and computational resources needed generate a tradeoff between the amount of loci that can be obtained and the number of populations or individuals that can be sequenced. In this work, we analyzed and compared two simulated genomic datasets fitting a hierarchical structure, two extensive empirical genomic datasets, and a dataset comprising microsatellite information. For all datasets, we generated different subsampling designs by changing the number of loci, individuals, populations, and individuals per population to test for deviations in classic population genetics parameters (HS, FIS, FST). For the empirical datasets we also analyzed the effect of sampling design on landscape genetic tests (isolation by distance and environment, central abundance hypothesis). We also tested the effect of sampling a different number of populations in the detection of outlier SNPs. We found that the microsatellite dataset is very sensitive to the number of individuals sampled when obtaining summary statistics. FIS was particularly sensitive to a low sampling of individuals in the simulated, genomic, and microsatellite datasets. For the empirical and simulated genomic datasets, we found that as long as many populations are sampled, few individuals and loci are needed. For the empirical datasets, we found that increasing the number of populations sampled was important in obtaining precise landscape genetic estimates. Finally, we corroborated that outlier tests are sensitive to the number of populations sampled. We conclude by proposing different sampling designs depending on the objectives.

Highlights

Massive parallel sequencing (MPS) has revolutionized the fields of molecular ecology, population genetics, and landscape genetics (Metzker, 2010; Stapley et al, 2010; Ekblom and Galindo, 2011)
For FIS estimations we found that when fewer individuals were sampled, the mean value across the 1,000 replicates was lower than the complete dataset, TABLE 2 | Summary statistics estimated for the DTS, 50K, and microsatellite datasets of Mexican wild maize
For the two simulated datasets, we found that sampling fewer loci slightly increased the relative error and the variance in the estimation of the summary statistics across replicates (Supplementary Figure S2)

Summary

Introduction

Massive parallel sequencing (MPS) has revolutionized the fields of molecular ecology, population genetics, and landscape genetics (Metzker, 2010; Stapley et al, 2010; Ekblom and Galindo, 2011). MPS is powerful in detecting patterns of local adaptation and understanding how the environment structures genetic diversity; its potential capacity depends on sampling a large geographic area, and encompassing an adequate environmental and genomic representation of the species (Schoville et al, 2012; De Mita et al, 2013; Tiffin and Ross-Ibarra, 2014). It is crucial to determine the potential biases associated with sampling (number of individuals, loci, and populations) and to define the tradeoff between the sampling effort and the number of polymorphic regions obtained with MPS that are needed to obtain robust estimates (Pruett and Winker, 2008; Willing et al, 2012; De Mita et al, 2013; Fumagalli, 2013)

Objectives

Results

Discussion

Conclusion