Abstract

BackgroundPrediction of long-range inter-residue contacts is an important topic in bioinformatics research. It is helpful for determining protein structures, understanding protein foldings, and therefore advancing the annotation of protein functions.ResultsIn this paper, we propose a novel ensemble of genetic algorithm classifiers (GaCs) to address the long-range contact prediction problem. Our method is based on the key idea called sequence profile centers (SPCs). Each SPC is the average sequence profiles of residue pairs belonging to the same contact class or non-contact class. GaCs train on multiple but different pairs of long-range contact data (positive data) and long-range non-contact data (negative data). The negative data sets, having roughly the same sizes as the positive ones, are constructed by random sampling over the original imbalanced negative data. As a result, about 21.5% long-range contacts are correctly predicted. We also found that the ensemble of GaCs indeed makes an accuracy improvement by around 5.6% over the single GaC.ConclusionsClassifiers with the use of sequence profile centers may advance the long-range contact prediction. In line with this approach, key structural features in proteins would be determined with high efficiency and accuracy.

Highlights

  • Prediction of long-range inter-residue contacts is an important topic in bioinformatics research

  • We found that the ensemble of genetic algorithm classifiers (GaCs) makes an accuracy improvement by around 5.6% over the single GaC

  • Performance comparison based on CASP7 evaluation The CASP7 evaluation procedure is focused on interresidue contact predictions with linear sequence separation ≥ 12 and ≥ 24, respectively [24,25], while in this work we only focus on long-range contact prediction with linear sequence separation ≥ 24 and with assessing the top L/5 predicted contacts, where L is protein sequence length

Read more

Summary

Introduction

Prediction of long-range inter-residue contacts is an important topic in bioinformatics research. It is helpful for determining protein structures, understanding protein foldings, and advancing the annotation of protein functions. It is often cost-expensive and speed-slow for proteins to be resolved by experimental techniques, such as x-ray crystallography and nuclear magnetic resonance (NMR). This is why more than ten million proteins are sequenced, while only 62,000 protein structures are stored in PDB. Previous results indicate that 50% correctly predicted contacts ought to suffice that reconstruction [5] at least for proteins with less than 150 amino acids and with 8Å distance cutoff

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.