CRYSTALP2: sequence-based protein crystallization propensity prediction

Lukasz Kurgan,Marcin Mizianty,Ali A Razib,Samad Jahandideh,Sara Aghakhani,Scott Dick

doi:10.1186/1472-6807-9-50

Lukasz Kurgan, Marcin Mizianty + Show 4 more

Open Access

https://doi.org/10.1186/1472-6807-9-50

Copy DOI

Abstract

BackgroundCurrent protocols yield crystals for <30% of known proteins, indicating that automatically identifying crystallizable proteins may improve high-throughput structural genomics efforts. We introduce CRYSTALP2, a kernel-based method that predicts the propensity of a given protein sequence to produce diffraction-quality crystals. This method utilizes the composition and collocation of amino acids, isoelectric point, and hydrophobicity, as estimated from the primary sequence, to generate predictions. CRYSTALP2 extends its predecessor, CRYSTALP, by enabling predictions for sequences of unrestricted size and provides improved prediction quality.ResultsA significant majority of the collocations used by CRYSTALP2 include residues with high conformational entropy, or low entropy and high potential to mediate crystal contacts; notably, such residues are utilized by surface entropy reduction methods. We show that the collocations provide complementary information to the hydrophobicity and isoelectric point. Tests on four datasets show that CRYSTALP2 outperforms several existing sequence-based predictors (CRYSTALP, OB-score, and SECRET). CRYSTALP2's accuracy, MCC, and AROC range between 69.3 and 77.5%, 0.39 and 0.55, and 0.72 and 0.79, respectively. Our predictions are similar in quality and are complementary to the predictions of the most recent ParCrys and XtalPred methods. Our results also suggest that, as work in protein crystallization continues (thereby enlarging the population of proteins with known crystallization propensities), the prediction quality of the CRYSTALP2 method should increase. The prediction model and the datasets used in this contribution can be downloaded from .ConclusionCRYSTALP2 provides relatively accurate crystallization propensity predictions for a given protein chain that either outperform or complement the existing approaches. The proposed method can be used to support current efforts towards improving the success rate in obtaining diffraction-quality crystals.

Highlights

Current protocols yield crystals for
Comparison with competing methods The CRYSTALP2 method was compared with SECRET, CRYSTALP, OB-Score, ParCrys, and XtalPred methods using two tests: the cross validation test on the D418 dataset, and a test in which the model was trained on the FEAT dataset and tested on the TEST, TEST-RL and TEST-NEW datasets
The ROC curve represents the relationship between the true positive (TP) and false positive (FP) rates; it is generated by establishing a threshold on the confidence scores from the predictors and varying the threshold values