Abstract

Analysis of population genetic variation and structure is a common practice for genome-wide studies, including association mapping, ecology, and evolution studies in several crop species. In this study, machine learning (ML) clustering methods, K-means (KM), and hierarchical clustering (HC), in combination with non-linear and linear dimensionality reduction techniques, deep autoencoder (DeepAE) and principal component analysis (PCA), were used to infer population structure and individual assignment of maize inbred lines, i.e., dent field corn (n = 97) and popcorn (n = 86). The results revealed that the HC method in combination with DeepAE-based data preprocessing (DeepAE-HC) was the most effective method to assign individuals to clusters (with 96% of correct individual assignments), whereas DeepAE-KM, PCA-HC, and PCA-KM were assigned correctly 92, 89, and 81% of the lines, respectively. These findings were consistent with both Silhouette Coefficient (SC) and Davies–Bouldin validation indexes. Notably, DeepAE-HC also had better accuracy than the Bayesian clustering method implemented in InStruct. The results of this study showed that deep learning (DL)-based dimensional reduction combined with ML clustering methods is a useful tool to determine genetically differentiated groups and to assign individuals into subpopulations in genome-wide studies without having to consider previous genetic assumptions.

Highlights

  • Analysis of population structure and genetic variation is a common practice in genome-wide studies and is an important guideline to understand and infer the evolutionary processes and the demographic history in ecological and evolutionary studies (Stift et al, 2019)

  • The results of KM and hierarchical clustering (HC) methods showed that Label encoder (LE) was less accurate than deep autoencoder (DeepAE) and principal component analysis (PCA) according to Silhouette Coefficient (SC) and Davies– Bouldin index (DBI) measures (Table 1)

  • The high values of SC obtained for PCA and DeepAE in combination with both clustering methods indicate that an inbred line is well matched to its own genetic cluster and poorly matched to the neighboring group or subpopulation

Read more

Summary

Introduction

Analysis of population structure and genetic variation is a common practice in genome-wide studies and is an important guideline to understand and infer the evolutionary processes and the demographic history in ecological and evolutionary studies (Stift et al, 2019). Population structure analysis is a major area of interest within the field of genetics and bioinformatics (Alhusain and Hafez, 2018) In this sense, several bioinformatics methods have been developed to examine the population structure in genetically diverse plant germplasm based on high-throughput genomic data. Gao et al (2007) proposed an extension to the STRUCTURE algorithm denominated InStruct, which eliminates the assumption of Hardy–Weinberg equilibrium within populations and takes inbreeding or selfing into account This method applies a Bayesian inference to simultaneously assign individuals into subpopulations but can be very time-consuming. The HC method allows the formation of genetic groups to be mutually exclusive, in which each cluster is distinct from each other, and the members of each cluster are similar with respect to the input information (Ward, 1963). Stift et al (2019) found that ADMIXTURE and KM were computationally faster than STRUCTURE; ADMIXTURE had less power to detect structure compared to STRUCTURE and KM clustering

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.