Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints.

Mario Lovrić,Han T N Tran,Tomislav Đuričić,Morten A Rasmussen,Hussain Hussain,Emanuel Lacić,Roman Kern

doi:10.3390/ph14080758

Abstract

Methods for dimensionality reduction are showing significant contributions to knowledge generation in high-dimensional modeling scenarios throughout many disciplines. By achieving a lower dimensional representation (also called embedding), fewer computing resources are needed in downstream machine learning tasks, thus leading to a faster training time, lower complexity, and statistical flexibility. In this work, we investigate the utility of three prominent unsupervised embedding techniques (principal component analysis—PCA, uniform manifold approximation and projection—UMAP, and variational autoencoders—VAEs) for solving classification tasks in the domain of toxicology. To this end, we compare these embedding techniques against a set of molecular fingerprint-based models that do not utilize additional pre-preprocessing of features. Inspired by the success of transfer learning in several fields, we further study the performance of embedders when trained on an external dataset of chemical compounds. To gain a better understanding of their characteristics, we evaluate the embedders with different embedding dimensionalities, and with different sizes of the external dataset. Our findings show that the recently popularized UMAP approach can be utilized alongside known techniques such as PCA and VAE as a pre-compression technique in the toxicology domain. Nevertheless, the generative model of VAE shows an advantage in pre-compressing the data with respect to classification accuracy.

Highlights

Chemical representation is an important topic in cheminformatics [1]and quantitative structure–activity relationships (QSARs), as QSAR model quality depends largely on the predictive features defined by the task at hand, i.e., mapping a feature space (X) onto a target chemical or biological activity (y)
Pharmaceuticals 2021, 14, 758 utilized [16]. Another difficulty in machine learning is the so-called curse of dimensionality, a term coined by Richard Bellman [17], which refers to various problems that arise when working with high-dimensional data including increased chances of overfitting and spurious results
The linearity of principal component analysis (PCA) is what makes the method mathematically more concise than some nonlinear methods, but at the price of variance maximization as well as the inability to capture nonlinear phenomena in single dimensions. When it comes to nonlinear methods for dimensionality reduction, there are a number of noteworthy approaches, such as locally linear embedding [26], Laplacian eigenmaps [27], or t-SNE [26]

Summary

Introduction

Chemical (or molecular) representation is an important topic in cheminformatics [1]and quantitative structure–activity relationships (QSARs), as QSAR model quality depends largely on the predictive features defined by the task at hand, i.e., mapping a feature space (X) onto a target chemical or biological activity (y). A well-established strategy, besides feature selection, to cope with this issue is dimensionality reduction, i.e., transforming the data into a low-dimensional space such that the resulting low-dimensional representation preserves certain properties of the original data Such an approach has proven to be highly useful for numerous downstream machine learning tasks like classification [19], anomaly detection [20], and recommender systems [21]. It is a nonlinear method that works by utilizing local manifold approximations and combining their local fuzzy simplicial set representations in order to create a topological representation of the high-dimensional data. It minimizes the cross-entropy between topological representations, optimizing the layout of the data representation. As they were originally proposed, there has been a number of adaptations of the original autoencoder with variational autoencoders or VAEs [34] as one of the latest state-of-the-art methods

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Pharmaceuticals	Publication Date: Aug 2, 2021
Citations: 17	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Pharmaceuticals

Lead the way for us

Similar Papers

Exploring combinations of dimensionality reduction, transfer learning, and regularization methods for predicting binary phenotypes with transcriptomic data
S. R. Oshternian ... R. S. N. Fehrmann
BMC Bioinformatics | VOL. 25
S. R. Oshternian, et. al.S. R. Oshternian ... R. S. N. Fehrmann
26 Apr 2024
BMC Bioinformatics | VOL. 25

Uniform Manifold Approximation and Projection (UMAP) Reveals Composite Patterns and Resolves Visualization Artifacts in Microbiome Data.
George Armstrong ... Yoshiki Vázquez-Baeza
mSystems | VOL. 6
George Armstrong, et. al.George Armstrong ... Yoshiki Vázquez-Baeza
05 Oct 2021
mSystems | VOL. 6

Analysis of representation and generalization capabilities of pre-trained audio models in urban environments
Daniele Atzeni ... Ester Vidaña-Vila
INTER-NOISE and NOISE-CON Congress and Conference Proceedings | VOL. 270
Daniele Atzeni, et. al.Daniele Atzeni ... Ester Vidaña-Vila
04 Oct 2024
INTER-NOISE and NOISE-CON Congress and Conference Proceedings | VOL. 270

Unsupervised Feature Representation Based on Deep Boltzmann Machine for Seizure Detection
Tengzi Liu ... Xucun Yan
IEEE Transactions on Neural Systems and Rehabilitation Engineering | VOL. 31
Tengzi Liu, et. al.Tengzi Liu ... Xucun Yan
01 Jan 2023
IEEE Transactions on Neural Systems and Rehabilitation Engineering | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Pharmaceuticals