Abstract

BackgroundIn crop production systems, genetic markers are increasingly used to distinguish individuals within a larger population based on their genetic make-up. Supervised approaches cannot be applied directly to genotyping data due to the specific nature of those data which are neither continuous, nor nominal, nor ordinal but only partially ordered. Therefore, a strategy is needed to encode the polymorphism between samples such that known supervised approaches can be applied. Moreover, finding a minimal set of molecular markers that have optimal ability to discriminate, for example, between given groups of varieties, is important as the genotyping process can be costly in terms of laboratory consumables, labor, and time. This feature selection problem also needs special care due to the specific nature of the data used.ResultsAn approach encoding SSR polymorphisms in a positive definite kernel is presented, which then allows the usage of any kernel supervised method. The polymorphism between the samples is encoded through the Nei-Li genetic distance, which is shown to define a positive definite kernel between the genotyped samples. Additionally, a greedy feature selection algorithm for selecting SSR marker kits is presented to build economical and efficient prediction models for discrimination. The algorithm is a filter method and outperforms other filter methods adapted to this setting. When combined with kernel linear discriminant analysis or kernel principal component analysis followed by linear discriminant analysis, the approach leads to very satisfactory prediction models.ConclusionsThe main advantage of the approach is to benefit from a flexible way to encode polymorphisms in a kernel and when combined with a feature selection algorithm resulting in a few specific markers, it leads to accurate and economical identification models based on SSR genotyping.

Highlights

  • In crop production systems, genetic markers are increasingly used to distinguish individuals within a larger population based on their genetic make-up

  • Four datasets were developed: a) tobType: A set of 91 varieties were genotyped on 186 simple sequence repeat markers (SSRs) markers without replicates; that lead to 91 observations

  • The results shown in the tables are the best 10-fold cross-validated results over the parameters of each method and the classification error rates for the different kit sizes, when combined with kernel linear discriminant analysis (KLDA) or kernel principal component analysis followed by linear discriminant analysis (KPCLDA) are shown in Table 2 and Table 3

Read more

Summary

Introduction

Genetic markers are increasingly used to distinguish individuals within a larger population based on their genetic make-up. Finding a minimal set of molecular markers that have optimal ability to discriminate, for example, between given groups of varieties, is important as the genotyping process can be costly in terms of laboratory consumables, labor, and time This feature selection problem needs special care due to the specific nature of the data used. Genetic markers are target sites in the genome that differ between individuals of a population These differences can occur in DNA that codes for specific genes, or usually in the vast areas of intergenic DNA. These differences in the make-up of the genetic content at a specific site in the genome are often referred to as polymorphisms (literally “multiple forms”). The set of alleles that has been collected for a given individual (often representing a single sample in the study) is referred to as the “genotype” of that individual

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.