Abstract

While data integration for data analysis has been investigated extensively in biological applications, it has not yet been so much the focus in computational chemistry and quantitative structure–activity relationship (QSAR) research. With the availability and growing number of chemical databases on the web, such data integration efforts become an intriguing possibility (and, in fact, a necessity). In this paper, we take a first step towards the following vision and scenario for predictive toxicology applications. Given a new structure to be predicted, the first step would be to gather (integrate) all relevant information from internet databases for the structure itself, and all structures with available information for the endpoint of interest. In a second step, the collected information is combined statistically into a prediction of the new structure. We simulate this scenario with three endpoints (data sets) from the DSSTox database and collect information from three public chemical databases: PubChem, ChemBank and Sigma-Aldrich. In the experiments, we investigate whether the addition of background knowledge from the three databases can improve predictive performance (over using chemical structure alone) in a statistically significant way. For this purpose, we define groups of features (belonging together from an application point of view) from the three databases, and perform a variant of forward selection to include these feature groups in a prediction model. Our experiments show that the integration of background knowledge from internet databases can significantly improve prediction performance, especially for regression tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call