Abstract

Natural products (NPs) represent one of the most important resources for discovering new drugs. Here we asked whether NP origin can be assigned from their molecular structure in a subset of 60,171 NPs in the recently reported Collection of Open Natural Products (COCONUT) database assigned to plants, fungi, or bacteria. Visualizing this subset in an interactive tree-map (TMAP) calculated using MAP4 (MinHashed atom pair fingerprint) clustered NPs according to their assigned origin (https://tm.gdb.tools/map4/coconut_tmap/), and a support vector machine (SVM) trained with MAP4 correctly assigned the origin for 94% of plant, 89% of fungal, and 89% of bacterial NPs in this subset. An online tool based on an SVM trained with the entire subset correctly assigned the origin of further NPs with similar performance (https://np-svm-map4.gdb.tools/). Origin information might be useful when searching for biosynthetic genes of NPs isolated from plants but produced by endophytic microorganisms.

Highlights

  • Due to the importance of natural products (NPs) in drug discovery [1, 2], there is a considerable interest in describing and understanding their structural diversity, by exploiting Natural product (NP) databases [3] using in silico methods such as machine learning (ML) [4]

  • Chemical space analysis of plant and microbial NPs from the Collection of Open Natural Products (COCONUT) database COCONUT collects over 400 thousand NPs from 52 different databases, 135 thousand of which are annotated with a taxonomical origin

  • We focused on those annotated as originating from plants (50%), fungi (23%), or bacteria (16%), leaving out a smaller subset of NPs originating from animals (2%), homo sapiens (2.5%), of marine origin (1.5%), or lacking one of the previous taxonomical annotations (5%)

Read more

Summary

Introduction

Due to the importance of natural products (NPs) in drug discovery [1, 2], there is a considerable interest in describing and understanding their structural diversity, by exploiting NP databases [3] using in silico methods such as machine learning (ML) [4]. In our own approach to this problem [14], we recently analyzed NPAtlas, an open-access database listing 25,523 NPs from bacterial or fungal origin [15], by computing the MAP4 fingerprint (MinHashed Atom-Pair fingerprint up to four bonds) [16] of each NP and creating a TMAP (tree-map) [17] of the resulting high-dimensional dataset. In this analysis, NPs from bacterial or fungal origin formed separated clusters. Among the many endophytic NPs [23, 24], striking examples include the cancer drug paclitaxel, a plant NP

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call