The prediction of molecular properties for a given molecular structure is of considerable interest for many physicochemical and biochemical processes. Efforts are particularly being made to theoretically predict retention times in the area of chromatographic separations that are based on molecular interactions between the molecules partitioning in separating systems comprised of two phases. Linear free-enthalpy relationships model chromatographic retention as a sum of individual energy contributions (dispersion, dipole–dipole, p–p, proton donor–acceptor interactions, etc.). However, owing to their complex molecular structures, as in the case of biopolymers such as peptides or nucleic acids, use is frequently made of the addition of empirically calculated retention contributions of the individual amino acids or nucleotides that are then corrected by terms that take account of the total structure of the molecule. However, for more complex molecules, this prediction model is imprecise and the relevant descriptors are very complex to determine. Sequence information (apart from the total composition) and information on secondary structures are not considered in these models. Models have been developed for peptides that do not derive retention from the properties of the molecular building blocks, but learn them from data sets obtained with test analyses of known structures. The retention data of approximately 7000 peptides have been used, for example, to train an artificial neural network (ANN) for the prediction of peptide retention times from peptide sequences with an accuracy of 3–10%. Other methods from the field of statistical learning, for example, support vector machines (SVMs), may be used for regression problems. In addition to the advantage of leading exactly to a globally optimal solution (unlike ANNs), support vector approaches have also proved themselves for practical use with chemical problems. A model based on the determination and addition of the retention contributions of the nucleotides has been developed to predict the retention of oligonucleotides in ion-pair reversed-phase chromatography (IP-RPC). This model provided satisfactory results at relatively high separating temperatures (60 8C) for cases in which secondary structures are less pronounced, whereas at lower temperatures, the influence of hairpin or partial double strands led to a poor correlation between the prediction and the experiment (own measurement results). Our model for the retention of oligonucleotides in IPRPC, even at low temperatures, is based on n-support vector regression (SVR) as proposed by Sch:lkopf et al. This method determines a model for a given data set that at the same time minimizes the model error and the model complexity. The training of this model is accomplished with a low number of 50–100 oligonucleotides. A test data set was created by the measurement of the retention times of 72 oligonucleotides. To record the influence of the sequence on the retention, 41 of the oligonucleotide sequences were generated by variation of a sequence of a 24mer (GTA CTC AGT GTA GCC CAG GAT GCC). To take into account other possible secondary structures, four further sequences were selected that form stable hairpin structures even at higher temperatures. The remaining sequences were finally selected so that they covered a length range of 15–48 nucleotides. Quantitative structure–property relationships (QSPR) code the input structures in the form of characteristic vectors. Relevant characteristics for the property to be predicted are, [*] Dipl.-Chem. S. Quinten, Dr. B. M. Mayr, Prof. Dr. C. G. Huber Fachbereich Chemie Instrumentelle Analytik und Bioanalytik Universit0t des Saarlandes Geb0ude B2.2, 66123 Saarbr2cken (Germany) Fax: (+49)681-302-2433 E-mail: christian.huber@mx.uni-saarland.de
Read full abstract