Improved Chemical Structure-Activity Modeling Through Data Augmentation.

Isidro Cortes-Ciriano,Andreas Bender

doi:10.1021/acs.jcim.5b00570

Abstract

Extending the original training data with simulated unobserved data points has proven powerful to increase both the generalization ability of predictive models and their robustness against changes in the structure of data (e.g., systematic drifts in the response variable) in diverse areas such as the analysis of spectroscopic data or the detection of conserved domains in protein sequences. In this contribution, we explore the effect of data augmentation in the predictive power of QSAR models, quantified by the RMSE values on the test set. We collected 8 diverse data sets from the literature and ChEMBL version 19 reporting compound activity as pIC50 values. The original training data were replicated (i.e., augmented) N times (N ∈ 0, 1, 2, 4, 6, 8, 10), and these replications were perturbed with Gaussian noise (μ = 0, σ = σnoise) on either (i) the pIC50 values, (ii) the compound descriptors, (iii) both the compound descriptors and the pIC50 values, or (iv) none of them. The effect of data augmentation was evaluated across three different algorithms (RF, GBM, and SVM radial) and two descriptor types (Morgan fingerprints and physicochemical-property-based descriptors). The influence of all factor levels was analyzed with a balanced fixed-effect full-factorial experiment. Overall, data augmentation constantly led to increased predictive power on the test set by 10-15%. Injecting noise on (i) compound descriptors or on (ii) both compound descriptors and pIC50 values led to the highest drop of RMSEtest values (from 0.67-0.72 to 0.60-0.63 pIC50 units). The maximum increase in predictive power provided by data augmentation is reached when the training data is replicated one time. Therefore, extending the original training data with one perturbed repetition thereof represents a reasonable trade-off between the increased performance of the models and the computational cost of data augmentation, namely increase of (i) model complexity due to the need for optimizing σnoise and (ii) the number of training examples.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improved Chemical Structure-Activity Modeling Through Data Augmentation.

Abstract

Talk to us

Similar Papers

More From: Journal of Chemical Information and Modeling

Lead the way for us

Journal: Journal of Chemical Information and Modeling	Publication Date: Dec 11, 2015
Citations: 34

Similar Papers

QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction
Isidro Cortés-Ciriano ... Andreas Bender
Journal of Cheminformatics | VOL. 12
Isidro Cortés-Ciriano, et. al.Isidro Cortés-Ciriano ... Andreas Bender
05 Jun 2020
Journal of Cheminformatics | VOL. 12

Classification of High‐Activity Tiagabine Analogs by Binary QSAR Modeling
Andreas Jurik ... Gerhard F Ecker
Molecular Informatics | VOL. 32
Andreas Jurik, et. al.Andreas Jurik ... Gerhard F Ecker
15 May 2013
Molecular Informatics | VOL. 32

Machine tool chattering monitoring by Chen-Lee chaotic system-based deep convolutional generative adversarial nets
Ping-Huan Kuo ... Her-Terng Yau
Structural Health Monitoring | VOL. 22
Ping-Huan Kuo, et. al.Ping-Huan Kuo ... Her-Terng Yau
20 Mar 2023
Structural Health Monitoring | VOL. 22

Application of multilayer perceptron with data augmentation in nuclear physics
Hüseyin Bahtiyar ... Esra Yüksel
Applied Soft Computing | VOL. 128
Hüseyin Bahtiyar, et. al.Hüseyin Bahtiyar ... Esra Yüksel
11 Aug 2022
Applied Soft Computing | VOL. 128

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improved Chemical Structure-Activity Modeling Through Data Augmentation.

Abstract

Talk to us

Similar Papers

More From: Journal of Chemical Information and Modeling