Abstract

BackgroundThe incomplete ground truth of training data of B-cell epitopes is a demanding issue in computational epitope prediction. The challenge is that only a small fraction of the surface residues of an antigen are confirmed as antigenic residues (positive training data); the remaining residues are unlabeled. As some of these uncertain residues can possibly be grouped to form novel but currently unknown epitopes, it is misguided to unanimously classify all the unlabeled residues as negative training data following the traditional supervised learning scheme.ResultsWe propose a positive-unlabeled learning algorithm to address this problem. The key idea is to distinguish between epitope-likely residues and reliable negative residues in unlabeled data. The method has two steps: (1) identify reliable negative residues using a weighted SVM with a high recall; and (2) construct a classification model on the positive residues and the reliable negative residues. Complex-based 10-fold cross-validation was conducted to show that this method outperforms those commonly used predictors DiscoTope 2.0, ElliPro and SEPPA 2.0 in every aspect. We conducted four case studies, in which the approach was tested on antigens of West Nile virus, dihydrofolate reductase, beta-lactamase, and two Ebola antigens whose epitopes are currently unknown. All the results were assessed on a newly-established data set of antigen structures not bound by antibodies, instead of on antibody-bound antigen structures. These bound structures may contain unfair binding information such as bound-state B-factors and protrusion index which could exaggerate the epitope prediction performance. Source codes are available on request.

Highlights

  • The incomplete ground truth of training data of B-cell epitopes is a demanding issue in computational epitope prediction

  • We show that the PUPre method demonstrates better performance compared to commonly used conformational B-cell epitope predictors, such as DiscoTope 2.0, ElliPro and SEPPA 2.0

  • When compared with the three structure-based epitope predictors DiscoTope 2.0, ElliPro and SEPPA 2.0, it is clear that the PUPre classifier outperforms their prediction results in every aspect

Read more

Summary

Introduction

The incomplete ground truth of training data of B-cell epitopes is a demanding issue in computational epitope prediction. Methods explored the use of essential characteristics of epitopes, and found useful individual features including hydrophobicity [4,5], flexibility [6], secondary structure [7], protrusion index (PI) [8], accessible surface area (ASA), relative accessible surface area (RSA) and B-factor [9,10]. None of these single characteristics is sufficient to locate B-cell epitopes accurately. Many epitope predictors have used machine learning techniques, such as Naive Bayesian learning [15] and random forest classification [10,16]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call