On the selection of the training set in environmental QSAR analysis when compounds are clustered

Lennart Eriksson,Martin M�Ller,Erik Johansson,Svante Wold

doi:10.1002/1099-128x(200009/12)14:5/6<599::aid-cem619>3.0.co;2-8

Lennart Eriksson, Martin M�Ller + Show 2 more

https://doi.org/10.1002/1099-128x(200009/12)14:5/6<599::aid-cem619>3.0.co;2-8

Copy DOI

Abstract

In QSAR analysis in environmental sciences, adverse effects of chemicals released to the environment are modelled and predicted as a function of the chemical properties of the pollutants. Usually the set of compounds under study contains several classes of substances, i.e. a more or less strongly clustered set. It is then needed to ensure that the selected training set comprises compounds representing all those chemical classes. Multivariate design in the principal properties of the compound classes is usually appropriate for selecting a meaningful training set. However, with clustered data, often seen in environmental chemistry and toxicology, a single multivariate design may be suboptimal because of the risk of ignoring small classes with few members and only selecting training set compounds from the largest classes. Recently a procedure for training set selection recognizing clustering was proposed by us. In this approach, when non-selective biological or environmental responses are modelled, local multivariate designs are constructed within each cluster (class). The chosen compounds arising from the local designs are finally united in the overall training set, which thus will contain members from all clusters. The proposed strategy is here further tested and elaborated by applying it to a series of 351 chemical substances for which the soil sorption coefficient is available. These compounds are divided into 14 classes containing between 10 and 52 members. The training set selection is discussed, followed by multivariate QSAR modelling, model interpretation and predictions for the test set. Various types of statistical experimental designs are tested during the training set selection phase. Copyright © 2000 John Wiley & Sons, Ltd.

Full Text