Abstract

BackgroundIt is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set.ResultsWe found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates.ConclusionIt has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.Electronic supplementary materialThe online version of this article (doi:10.1186/1471-2105-15-241) contains supplementary material, which is available to authorized users.

Highlights

  • It is important to accurately determine the performance of peptide:Major Histocompatibility Complex (MHC) binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate

  • The binding data came from the Immune Epitope Database (IEDB) [8], as well as some data from submissions currently in process from the Buus and Sette labs

  • BD2009 and BD2013 are data sets prepared in years 2009 and 2013, respectively, for re-training of the predictive tools hosted on the Immune Epitope Database Analysis Resource (IEDB-AR) [11]

Read more

Summary

Introduction

It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. We have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set. Accompanying the growth of the binding data, many MHC class I peptide binding predictors have been reported to date. To compare their predictive performances, a number of large-scale benchmarking studies have been carried out. In the case of MHC-I predictors, high predictive performances with average Areas under Receiver Operating Characteristic curves (AROCs) of ~0.9 from cross-validations have been reported [9,10], suggesting that the predictive methods have matured

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call