Abstract

Germplasm banks are growing in their importance, number of accessions and amount of characterization data, with a large emphasis on molecular genetic markers. In this work, we offer an integrated view of accessions and marker data in an information theory framework. The basis of this development is the mutual information between accessions and allele frequencies for molecular marker loci, which can be decomposed in allele specificities, as well as in rarity and divergence of accessions. In this way, formulas are provided to calculate the specificity of the different marker alleles with reference to their distribution across accessions, accession rarity, defined as the weighted average of the specificity of its alleles, and divergence, defined by the Kullback-Leibler formula. Albeit being different measures, it is demonstrated that average rarity and divergence are equal for any collection. These parameters can contribute to the knowledge of the structure of a germplasm collection and to make decisions about the preservation of rare variants. The concepts herein developed served as the basis for a strategy for core subset selection called HCore, implemented in a publicly available R script. As a proof of concept, the mathematical view and tools developed in this research were applied to a large collection of Mexican wheat accessions, widely characterized by SNP markers. The most specific alleles were found to be private of a single accession, and the distribution of this parameter had its highest frequencies at low levels of specificity. Accession rarity and divergence had largely symmetrical distributions, and had a positive, albeit non-strictly linear relationship. Comparison of the HCore approach for core subset selection, with three state-of-the-art methods, showed it to be superior for average divergence and rarity, mean genetic distance and diversity. The proposed approach can be used for knowledge extraction and decision making in germplasm collections of diploid, inbred or outbred species.

Highlights

  • Germplasm banks worldwide contain collections, mainly of cultivated plants and their relatives, to preserve and make available to plant breeders, researchers and general users, their reservoirs of genetic diversity

  • We provide a novel view of a collection in a germplasm bank tied to marker data

  • It can be considered as a digital view, in the sense that genomes are binarily coded through molecular markers, providing the elements for an informational landscape, where allele specificities are defined in terms of their information about the identities of the accessions, and accession rarities are defined by the average specificity of their alleles

Read more

Summary

Introduction

Germplasm banks worldwide contain collections, mainly of cultivated plants and their relatives, to preserve and make available to plant breeders, researchers and general users, their reservoirs of genetic diversity. They may be present or even fixed in certain populations due to genetic drift or their relationship with fitness in specific environments Such uniqueness make them prone to be absent in whole collections and their subsets, albeit their potential importance as a source of important traits for plant breeding. To the best of our knwledge, only one definition of the rarity of an accession, based on marker data, has been proposed, in the context of application to SSR maize data [9] It is basically the square euclidean distance between the array of allele frequencies in a given accession, and the average frequencies for the whole collection. We use this approach to define the specificity of alleles and the rarity and divergence of accessions, based on information of polymorphic DNA markers

Methods
Results and discussion
Method
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call