Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling.

Arkadii Lin,Alexandre Varnek,Gilles Marcou,Bernd Beck,Igor I Baskin,Dragos Horvath

doi:10.1002/minf.202000009

Abstract

Generative Topographic Mapping (GTM) can be efficiently used to visualize, analyze and model large chemical data. The GTM manifold needs to span the chemical space deemed relevant for a given problem. Therefore, the Frame set (FS) of compounds used for the manifold construction must well cover a given chemical space. Intuitively, the FS size must raise with the size and diversity of the target library. At the same time, the GTM training can be very slow or even becomes technically impossible at FS sizes of the order of 105 compounds – which is a very small number compared to today's commercially accessible compounds, and, especially, to the theoretically feasible molecules. In order to solve this problem, we propose a Parallel GTM algorithm based on the merging of “intermediate” manifolds constructed in parallel for different subsets of molecules. An ensemble of these subsets forms a FS for the “final” manifold. In order to assess the efficiency of the new algorithm, 80 GTMs were built on the FSs of different sizes ranging from 10 to 1.8 M compounds selected from the ChEMBL database. Each GTM was challenged to build classification models for up to 712 biological activities (depending on the FS size). With the novel parallel GTM procedure, we could thus cover the entire spectrum of possible FS sizes, whereas previous studies were forced to rely on the working hypothesis that FS sizes of few thousands of compounds are sufficient to describe the ChEMBL chemical space. In fact, this study formally proves this to be true: a FS containing only 5000 randomly picked compounds is sufficient to represent the entire ChEMBL collection (1.8 M molecules), in the sense that a further increase of FS compound numbers has no benefice impact on the predictive propensity of the above‐mentioned 712 activity classification models. Parallel GTM may, however, be required to generate maps based on very large FS, that might improve chemical space cartography of big commercial and virtual libraries, approaching billions of compounds

Highlights

Nowadays, public and private chemical databases contain millions of already synthesized compounds (ChEMBL[1], PubChem[2], CAS[3], etc.) and billions of computer-generated virtual structures (GDB-17[4])
Generative Topographic Mapping (GTM) is a probabilistic extension of the Self-Organizing Mapping (SOM)[33] method where log-likelihood is utilized as an objective function.[12]
Four intermediate GTM manifolds were trained on 5K compounds each, and the entire ChEMBL collection was projected on them as well as on the final manifold

Summary

Introduction

Public and private chemical databases contain millions of already synthesized compounds (ChEMBL[1], PubChem[2], CAS[3], etc.) and billions of computer-generated virtual structures (GDB-17[4]). This chemical universe needs to be explored and analyzed. To describe the entire data set, a vector of cumulative responsibilities can be built using responsibility vectors of individual compounds. The latter can be associated with class or property values which leads to GTM Class Landscape or GTM Property Landscape. These landscapes can be used as classification and regression models in various chemoinformatics tasks.[13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Molecular informatics	Publication Date: Apr 29, 2020
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Molecular informatics

Lead the way for us

Similar Papers

Mapping of the Available Chemical Space versus the Chemical Universe of Lead-Like Compounds.
Arkadii Lin ... Jean‐Louis Reymond
ChemMedChem | VOL. 13
Arkadii Lin, et. al.Arkadii Lin ... Jean‐Louis Reymond
29 Jan 2018
ChemMedChem | VOL. 13

QSAR modeling and chemical space analysis of antimalarial compounds.
Pavel Sidorov ... Birgit Viira
Journal of computer-aided molecular design | VOL. 31
Pavel Sidorov, et. al.Pavel Sidorov ... Birgit Viira
03 Apr 2017
Journal of computer-aided molecular design | VOL. 31

Meta-GTM: Visualization and Analysis of the Chemical Library Space.
Regina Pikalyova ... Alexandre Varnek
Journal of chemical information and modeling | VOL. 63
Regina Pikalyova, et. al.Regina Pikalyova ... Alexandre Varnek
21 Aug 2023
Journal of chemical information and modeling | VOL. 63

Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics
Héléna A Gaspar ... Gerome Breen
BMC Bioinformatics | VOL. 20
Héléna A Gaspar, et. al.Héléna A Gaspar ... Gerome Breen
07 Mar 2019
BMC Bioinformatics | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Molecular informatics