In this study we want to test whether a simple modeling procedure used in the field of QSAR/QSPR can produce simple models that will be, at the same time, as accurate as robust Neural Network Ensemble (NNE) ones. We present results of application of two procedures for generating/selecting simple linear and nonlinear multiregression (MR) models: (1) method for selecting the best possible MR models (named as CROMRsel) and (2) Genetic Function Approximation (GFA) method from the Cerius2 program package. The obtained MR models are strictly compared with several NNE models. For the comparison we selected four QSAR data sets previously studied by NNE (Tetko et al. J. Chem. Inf. Comput. Sci. 1996, 36, 794-803. Kovalishyn et al. J. Chem. Inf. Comput. Sci. 1998, 38, 651-659.): (1) 51 benzodiazepine derivatives, (2) 37 carboquinone derivatives, (3) 74 pyrimidines, and (4) 31 antimycin analogues. These data sets were parameterized with 7, 6, 27, and 53 descriptors, respectively. Modeled properties were anti-pentylenetetrazole activity, antileukemic activity, inhibition constants to dihydrofolate reductase from MB1428 E. coli, and antifilarial activity, respectively. Nonlinearities were introduced into the MR models through 2-fold and/or 3-fold cross-products of initial (linear) descriptors. Then, using the CROMRsel and GFA programs (J. Chem. Inf. Comput. Sci. 1999, 39, 121-132) the sets of I (I < or = 8, in this paper) the best descriptors (according to the fit and leave-one-out correlation coefficients) were selected for multiregression models. Two classes of models were obtained: (1) linear or nonlinear MR models which were generated starting from the complete set of descriptors, and (2) nonlinear MR models which were generated starting from the same set of descriptors that was used in the NNE modeling. In addition, the descriptor selection method from CROMRsel was compared with the GFA method included in the QSAR module of the Cerius2 program. For each data set it has been found that the MR models have better cross-validated statistical parameters than the corresponding NNE models and that CROMRsel selects somewhat better MR models than the GFA method. MR models are also much simpler than NNEs, which is the important surprising fact, and, additionally, express calculated dependencies in a functional form. Moreover, MR models were shown to be better than all other models obtained by different methods on the same data sets ("old" multivariate regressions, functional-link-net models, back-propagation neural networks, genetic algorithm, and partial least squares models). This study also indicated that the robust NNE models cannot generate good models when applied on small data sets, suggesting that it is perhaps better to apply robust methods (like NNE ones) on larger data sets.
Read full abstract