Abstract

A number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several maize genomes as a model to: (1) compare three well-known clustering methods: UPGMA, k-means and, Bayesian method (BM); (2) asses four internal validation indices: CH, Connectivity, Dunn and Silhouette, to determine the reliable number of groups defining a PGS; and (3) estimate the misclassification rate for each validation index. Moreover, a publicly available maize dataset was used to illustrate the outcomes of our simulation. BM was the best method to classify individuals in all tested scenarios, without assignment errors. Conversely, UPGMA was the method with the highest misclassification rate. In scenarios with 5 and 10 subpopulations, CH and Connectivity indices had the maximum underestimation of group number for all cluster algorithms. Dunn and Silhouette indices showed the best performance with BM. Nevertheless, since Silhouette measures the degree of confidence in cluster assignment, and BM measures the probability of cluster membership, these results should be considered with caution. In this study we found that BM showed to be efficient to depict the PGS in both simulated and real maize datasets. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species.

Highlights

  • The genetic diversity of a group of individuals can be exhaustively characterized in different species (Becerra and Paredes 2000) using new technologies that allow us to evaluate thousands of genomic variants simultaneously (González-recio et al 2014; Baloch et al 2017)

  • Since Silhouette measures the degree of confidence in cluster assignment, and Bayesian Method (BM) measures the probability of cluster membership, these results should be considered with caution

  • Single Nucleotide Polymorphism (SNP) markers have gained importance to explain a great proportion of the variance among individuals, and are the markers most widely used to identify genetic similarity patterns because they are very abundant in the genome (Baloch et al 2017)

Read more

Summary

Introduction

The genetic diversity of a group of individuals can be exhaustively characterized in different species (Becerra and Paredes 2000) using new technologies that allow us to evaluate thousands of genomic variants simultaneously (González-recio et al 2014; Baloch et al 2017). Single Nucleotide Polymorphism (SNP) markers have gained importance to explain a great proportion of the variance among individuals, and are the markers most widely used to identify genetic similarity patterns because they are very abundant in the genome (Baloch et al 2017). This variability among individuals of a single population, generating internal groups or subgroups, may be due to very diverse causes, including gene flow, dispersion, introgression or mutations (Dutheil 2020). Exploring the number of genetic groups within a set of individual genotypes and assigning individuals to groups has become an essential task in population genetics studies (Beugin et al 2018) as well as other areas, such as plant breeding, in which the phenotypic information is complemented with genotypic data (Thorwarth et al 2017; Haile et al 2018; Yuan et al 2020)

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call