CanSAR chemistry registration and standardization pipeline

Daniela Dolciami,Christos Kannas,Albert A Antolin,Mirco Meniconi,Eloy Villasclaras-Fernandez,Bissan Al-Lazikani

doi:10.1186/s13321-022-00606-7

Abstract

BackgroundIntegration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach.ResultsWe created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds’ hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL’s RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem’s OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step.ConclusionsWe use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Cheminformatics	Publication Date: May 28, 2022
Citations: 8	License type: open-access

R Discovery Prime

R Discovery Prime

CanSAR chemistry registration and standardization pipeline

Abstract

Talk to us

Similar Papers

More From: Journal of Cheminformatics

Lead the way for us

Similar Papers

AI3SD Video: Pitfalls and Gotcha’s with bioactivity data

-

22 Oct 2020
22 Oct 2020

An adaptive redundant reservation admission in virtual cloud environment
Yongjian Li ... Dongbo Liu
International Journal of Networking and Virtual Organisations | VOL. 20
Yongjian Li, et. al.Yongjian Li ... Dongbo Liu
01 Jan 2019
International Journal of Networking and Virtual Organisations | VOL. 20

A Novel Statistic-based Relaxed Grid Resource Reservation Strategy
Peng Xiao ... Zhigang Hu
-
Peng Xiao, et. al.Peng Xiao ... Zhigang Hu
01 Nov 2008
01 Nov 2008

The Norwegian particles jo and nok in second language writing : a qualitative study of three learner groups from the ASK-corpus
Paulina Horbowicz ... Marta Olga Janik
Brünner Beiträge zur Germanistik und Nordistik | VOL. -
Paulina Horbowicz, et. al.Paulina Horbowicz ... Marta Olga Janik
01 Jan 2020
Brünner Beiträge zur Germanistik und Nordistik | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CanSAR chemistry registration and standardization pipeline

Abstract

Talk to us

Similar Papers

More From: Journal of Cheminformatics