Abstract

Variable selection k Nearest Neighbor (kNN) QSAR is a popular nonlinear methodology for building correlation models between chemical descriptors of compounds and biological activities. The models are built by finding a subspace of the original descriptor space where activity of each compound in the data set is most accurately predicted as the averaged activity of its k nearest neighbors in this subspace. We have formulated the problem of searching for the optimized kNN QSAR models with the highest predictive power as a variational problem. We have investigated the relative contribution of several model parameters such as the selection of variables, the number (k) of nearest neighbors, and the shape of the weighting function used to evaluate the contributions of k nearest neighbor compound activities to the predicted activity of each compound. We have derived the expression for the weighting function which maximizes the model performance. This optimization methodology was applied to several experimental data sets divided into the training and test sets. We report a significant improvement of both the leave-one-out cross-validated R(2) (q(2)) for the training sets and predictive R(2) of the test sets in all cases. Depending on the data set, the average improvements in the prediction accuracy (prediction R(2)) for the test sets ranged between 1.1% and 94% and for the training sets (q(2)) between 3.5% and 118%. We also describe a modified computational procedure for model building based on the use of relational databases to store descriptors and calculate compounds' similarities, which simplifies calculations and increases their efficiency.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call