Improving peptide-MHC class I binding prediction for unbalanced datasets

Ana Paula Sales,Georgia D Tomaras,Thomas B Kepler

doi:10.1186/1471-2105-9-385

Ana Paula Sales, Georgia D Tomaras + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-9-385

Copy DOI

Journal: BMC bioinformatics	Publication Date: Sep 19, 2008
Citations: 34	License type: cc-by

Affiliation: Duke University

Abstract

BackgroundEstablishment of peptide binding to Major Histocompatibility Complex class I (MHCI) is a crucial step in the development of subunit vaccines and prediction of such binding could greatly reduce costs and accelerate the experimental process of identifying immunogenic peptides. Many methods have been applied to the prediction of peptide-MHCI binding, with some achieving outstanding performance. Because of the experimental methods used to measure binding or affinity between peptides and MHCI molecules, however, available datasets are enriched for nonbinders, and thus highly unbalanced. Although there is no consensus on the ideal class distribution for training sets, extremely unbalanced datasets can be detrimental to the performance of prediction algorithms.ResultsWe have developed a decision-theoretic framework to construct cost-sensitive trees to predict peptide-MHCI binding and have used them to 1) Assess the impact of the training data's class distribution on classifier accuracy, and 2) Compare resampling and cost-sensitive methods as approaches to compensate for training data imbalance. Our results confirm that highly unbalanced training sets can reduce the accuracy of classifier predictions and show that, in the peptide-MHCI binding context, resampling methods do not improve the classifier performance. In contrast, cost-sensitive methods significantly improve accuracy of decision trees. Finally, we propose the use of a training scheme that, when the training set is enriched for nonbinders, consistently improves the overall classifier accuracy compared to cost-insensitive classifiers and, in particular, increases the sensitivity of the classifiers. This method minimizes the expected classification cost for large datasets.ConclusionOur method consistently improves the performance of decision trees in predicting peptide-MHC class I binding by using cost-balancing techniques to compensate for the imbalance in the training dataset.

Highlights

Establishment of peptide binding to Major Histocompatibility Complex class I (MHCI) is a crucial step in the development of subunit vaccines and prediction of such binding could greatly reduce costs and accelerate the experimental process of identifying immunogenic peptides
The ability to predict the binding between peptides and MHCI molecules would greatly reduce costs and accelerate the experimental process of identifying immunogenic peptides, which can be used in the development of vaccines and therapies against neoplastic, infectious, and autoimmune diseases
Our results suggests that for a fixed training set size, decision trees perform best when trained with datasets of nearly balanced class distribution

Summary

Introduction

Establishment of peptide binding to Major Histocompatibility Complex class I (MHCI) is a crucial step in the development of subunit vaccines and prediction of such binding could greatly reduce costs and accelerate the experimental process of identifying immunogenic peptides. The ability to predict the binding between peptides and MHCI molecules would greatly reduce costs and accelerate the experimental process of identifying immunogenic peptides, which can be used in the development of vaccines and therapies against neoplastic, infectious, and autoimmune diseases. The investigator has to consider the benefits of identifying binders versus the cost associated with experimentally testing nonbinders in order to decide which and how many peptides will be tested in the laboratory This type of concern can be best addressed by the use of decision-theoretic approaches. We formalize such an approach to training decision trees to differentiate binders from nonbinders and show how costs that reflect this experimental tradeoff can be incorporated into the training of classifiers to increase their utility

Objectives

Methods

Results

Discussion

Conclusion