Abstract

Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for uncharacterized protein sequences, with most of the developed methods relying heavily on evolutionary information collected from homologous sequences. Here we show that there is a substantial observational selection bias in this approach: the predictions are validated on proteins with known structures from the PDB, but exactly for those proteins significantly more homologs are available compared to less studied sequences randomly extracted from Uniprot. Structural bioinformatics methods that were developed this way are thus likely to have over-estimated performances; we demonstrate this for two contact prediction methods, where performances drop up to 60% when taking into account a more realistic amount of evolutionary information. We provide a bias-free dataset for the validation for contact prediction methods called NOUMENON.

Highlights

  • All bioinformatics tools developed to address protein structure-related tasks share the same, crucial, characteristic: they need a validation procedure based on experimentally determined data to evaluate their performances

  • Many structural bioinformatics methods that predict structural characteristics from protein sequence validate their performance on known protein structures and use evolutionary information in to boost prediction performance

  • This represents an observation selection bias that inflates prediction performance because more homologs are available for exactly those proteins that constitute the validation sets: the evolutionary information available for validated proteins differ from the real case applications for which bioinformatics methods intend to provide useful annotations

Read more

Summary

Introduction

Our findings question the de facto applicability of structural bioinformatics tools that fit the two criteria on real cases, i.e. structurally undetermined proteins with a representative set of homologs, and calls for a more careful evaluation of their performances. This is essential to understand the reliability of the results, and to avoid long-term negative effects on structural bioinformatics research: the necessity to boost the performances of a tool in order to achieve a publication could lead to a positive selection of methods that take advantage from information that is not available in real case applications

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call