Abstract

The quality of data used for QSAR model derivation is extremely important as it strongly affects the final robustness and predictive power of the model. Ambiguous or wrong structures need to be carefully checked, because they lead to errors in calculation of descriptors, hence leading to meaningless results. The increasing amounts of data, however, have often made it hard to check of very large databases manually. In the light of this, we designed and implemented a semi-automated workflow integrating structural data retrieval from several web-based databases, automated comparison of these data, chemical structure cleaning, selection and standardization of data into a consistent, ready-to-use format that can be employed for modeling. The workflow integrates best practices for data curation that have been suggested in the recent literature. The workflow has been implemented with the freely available KNIME software and is freely available to the cheminformatics community for improvement and application to a broad range of chemical datasets.

Highlights

  • Quantitative Structure–Activity Relationships (QSARs) are statistical models relating a property/activity of a set of chemicals to their structural features, encoded in a numerical notation by means of molecular descriptors.It is intuitive that a quantitative structure activity relationship (QSAR)’s predictions cannot be more accurate than the original data used for its derivation [1]

  • It is of the utmost importance that the dataset used for model derivation contains high quality data, because any error in chemical structure or biological data will be implicitly transferred into the QSAR model

  • The final output of the workflow is a curated QSAR-ready dataset comprising only reliable and high-quality data that can be used for modeling exercises

Read more

Summary

Introduction

Quantitative Structure–Activity Relationships (QSARs) are statistical models relating a property/activity (i.e. endpoint) (e.g. pharmacological effect, or the toxicity, physico-chemical or bio-physical properties) of a set of chemicals to their structural features, encoded in a numerical notation by means of molecular descriptors.It is intuitive that a QSAR’s predictions cannot be more accurate than the original data used for its derivation [1]. It is of the utmost importance that the dataset used for model derivation (i.e. the training set) contains high quality data, because any error in chemical structure or biological data will be implicitly transferred into the QSAR model. In these regards, a careful curation and selection of input data is essential [2] (Fig. 1). More and more web-based data services and tools have emerged, that provide a way to store and constantly update information on thousands of different chemical structures. Examples include ChemIDplus [4] and PubChem [5]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call