Abstract

Feature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP chips to evaluate selection candidates. In this research, several univariate and multivariate FS algorithms combined with various parametric and non-parametric learners were applied to the prediction of feed efficiency in growing pigs from high-dimensional genomic data. The objective was to find the best combination of feature selector, SNP subset size, and learner leading to accurate and stable (i.e., less sensitive to changes in the training data) prediction models. Genomic best linear unbiased prediction (GBLUP) without SNP pre-selection was the benchmark. Three types of FS methods were implemented: (i) filter methods: univariate (univ.dtree, spearcor) or multivariate (cforest, mrmr), with random selection as benchmark; (ii) embedded methods: elastic net and least absolute shrinkage and selection operator (LASSO) regression; (iii) combination of filter and embedded methods. Ridge regression, support vector machine (SVM), and gradient boosting (GB) were applied after pre-selection performed with the filter methods. Data represented 5,708 individual records of residual feed intake to be predicted from the animal’s own genotype. Accuracy (stability of results) was measured as the median (interquartile range) of the Spearman correlation between observed and predicted data in a 10-fold cross-validation. The best prediction in terms of accuracy and stability was obtained with SVM and GB using 500 or more SNPs [0.28 (0.02) and 0.27 (0.04) for SVM and GB with 1,000 SNPs, respectively]. With larger subset sizes (1,000–1,500 SNPs), the filter method had no influence on prediction quality, which was similar to that attained with a random selection. With 50–250 SNPs, the FS method had a huge impact on prediction quality: it was very poor for tree-based methods combined with any learner, but good and similar to what was obtained with larger SNP subsets when spearcor or mrmr were implemented with or without embedded methods. Those filters also led to very stable results, suggesting their potential use for designing low-density SNP chips for genome-based evaluation of feed efficiency.

Highlights

  • Statistical models and methods used for predicting phenotypes or breeding values of selection candidates have an impact on the efficiency of genomic selection (GS)

  • The objective of our research was to explore the influence of various combinations of FS methods and learners on prediction quality and stability of models for predicting residual feed intake (RFI) from SNP genotypes, in order to find the best strategy for genetic evaluation of growing pigs at reduced genotyping cost

  • For each FS method and learner, there are 10 subsets of selected features and 10 prediction performances obtained with those subsets, allowing a measurement of the stability of the FS method, as well as a measurement of the dispersion of prediction accuracy

Read more

Summary

Introduction

Statistical models and methods used for predicting phenotypes or breeding values of selection candidates have an impact on the efficiency of genomic selection (GS). Machine learning (ML) methods are appealing for genomic prediction; they encompass a wide variety of techniques and models to predict outputs or to identify patterns in large datasets. Those methods do not require assumptions about the genetic determinism underlying the trait. When features have a high level of redundancy, different training samples can produce different feature ranks (and different models when a subset of features is selected) with the same prediction accuracy. It is wished to achieve a good prediction performance on independent data sets and a stable possible set of predictors, this being understood as those less sensitive to changes in the training set

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call