Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets

Gerard Jp Van Westen,Herman Wt Van Vlijmen,Remco F Swier,Isidro Cortes-Ciriano,John P Overington,Jörg K Wegner,Andreas Bender,Adriaan P Ijzerman

doi:10.1186/1758-2946-5-42

Abstract

BackgroundWhile a large body of work exists on comparing and benchmarking descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. Hence, in the current work a total of 13 amino acid descriptor sets have been benchmarked with respect to their ability of establishing bioactivity models. The descriptor sets included in the study are Z-scales (3 variants), VHSE, T-scales, ST-scales, MS-WHIM, FASGAI, BLOSUM, a novel protein descriptor set (termed ProtFP (4 variants)), and in addition we created and benchmarked three pairs of descriptor combinations. Prediction performance was evaluated in seven structure-activity benchmarks which comprise Angiotensin Converting Enzyme (ACE) dipeptidic inhibitor data, and three proteochemometric data sets, namely (1) GPCR ligands modeled against a GPCR panel, (2) enzyme inhibitors (NNRTIs) with associated bioactivities against a set of HIV enzyme mutants, and (3) enzyme inhibitors (PIs) with associated bioactivities on a large set of HIV enzyme mutants.ResultsThe amino acid descriptor sets compared here show similar performance (<0.1 log units RMSE difference and <0.1 difference in MCC), while errors for individual proteins were in some cases found to be larger than those resulting from descriptor set differences ( > 0.3 log units RMSE difference and >0.7 difference in MCC). Combining different descriptor sets generally leads to better modeling performance than utilizing individual sets. The best performers were Z-scales (3) combined with ProtFP (Feature), or Z-Scales (3) combined with an average Z-Scale value for each target, while ProtFP (PCA8), ST-Scales, and ProtFP (Feature) rank last.ConclusionsWhile amino acid descriptor sets capture different aspects of amino acids their ability to be used for bioactivity modeling is still – on average – surprisingly similar. Still, combining sets describing complementary information consistently leads to small but consistent improvement in modeling performance (average MCC 0.01 better, average RMSE 0.01 log units lower). Finally, performance differences exist between the targets compared thereby underlining that choosing an appropriate descriptor set is of fundamental for bioactivity modeling, both from the ligand- as well as the protein side.

Highlights

While a large body of work exists on comparing and benchmarking descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking
70–30 validation on G protein-coupled receptor (GPCR) ligands In a similar spirit to the validation on Angiotensin Converting Enzyme (ACE) inhibitors, a similar 70–30 validation was performed on the GPCR set
In this case a classification model was employed and performance was expressed as mean sensitivity and mean Matthews correlation coefficient (MCC) for all descriptor sets in the study [32]

Summary

Introduction

While a large body of work exists on comparing and benchmarking descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. The technique is similar to Quantitative Structure-Activity Relationship (QSAR) modeling but expands on its ligand-only nature in that it takes both ligand- and target space into account when generating bioactivity models. This enables PCM to explain bioactivity based on chemical properties (features of the ligand) in combination with particular protein properties (features of the target). PCM models are able to extrapolate in both the chemical (ligand) as well as the biological (target) domain (under the limitations of the data and the models constructed), as shown in previous work [5,6,7]. For a further rationale of the current work the reader is referred to the companion paper [21]

Objectives

Methods

Results

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Cheminformatics	Publication Date: Sep 24, 2013
Citations: 113	License type: CC BY 2.0

R Discovery Prime

Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Journal of Cheminformatics

Lead the way for us

Similar Papers

Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets
Gerard Jp Van Westen ... Adriaan P Ijzerman
Journal of Cheminformatics | VOL. 5
Gerard Jp Van Westen, et. al.Gerard Jp Van Westen ... Adriaan P Ijzerman
23 Sep 2013
Journal of Cheminformatics | VOL. 5

ACE2 in the Urine: Where Does It Come From?
Jan Wysocki ... Daniel Batlle
Kidney360 | VOL. 3
Jan Wysocki, et. al.Jan Wysocki ... Daniel Batlle
01 Dec 2022
Kidney360 | VOL. 3

New descriptors of amino acids and their application to peptide QSAR study
Zhi-Hua Lin ... Yu-Zhang Wu
Peptides | VOL. 29
Zhi-Hua Lin, et. al.Zhi-Hua Lin ... Yu-Zhang Wu
18 Jun 2008
Peptides | VOL. 29

A new set of amino acid descriptors and its application in peptide QSARs
Hu Mei ... Yuan Zhou
Peptide Science | VOL. 80
Hu Mei, et. al.Hu Mei ... Yuan Zhou
01 Jan 2004
Peptide Science | VOL. 80

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Journal of Cheminformatics