Abstract

Computational approaches for synthesizing new chemical compounds have resulted in a major explosion of chemical data in the field of drug discovery. The quantitative structure–activity relationship (QSAR) is a widely used classification and regression method used to represent the relationship between a chemical structure and its activities. This research focuses on the effect of dimensionality-reduction techniques on a high-dimensional QSAR dataset. Because of the multi-dimensional nature of QSAR, dimensionality-reduction techniques have become an integral part of its modeling process. Principal component analysis (PCA) is a feature extraction technique with several applications in exploratory data analysis, visualization and dimensionality reduction. However, linear PCA is inadequate to handle the complex structure of QSAR data. In light of the wide array of current feature-extraction techniques, we perform a comparative empirical study to investigate five feature-extraction techniques: PCA, kernel PCA, deep generalized autoencoder (dGAE), Gaussian random projection (GRP), and sparse random projection (SRP). The experiments are performed on a high-dimensional QSAR dataset, which comprises 6394 features. The transformed low-dimensional dataset is inputted into a deep learning classification model to predict a QSAR biological activity. Three approaches are adopted to validate and measure the proposed techniques: (i) comparing the performance of the classification models, (ii) visualizing the relationship (correlation) between features in the low-dimension Euclidean space, and (iii) validating the proposed techniques using an external dataset. To the best of our knowledge, this study is the first to investigate and compare the aforementioned feature-extraction techniques in QSAR modeling context. The results obtained provide invaluable insights regarding the behavior of different techniques with both negative and positive classes. With linear PCA as a baseline, we prove that the investigated techniques substantially outperform the baseline in multiple accuracy measures and demonstrate useful ways of extracting significant features.

Highlights

  • The rapid development of technology had led to explosive growth in data in many fields

  • To address the formally presented perspectives, we addressed the following fundamental question: How can the high dimensionality of Quantitative structure–activity relation (QSAR) datasets be reduced to a low-dimensional space with a minimum loss of valuable information? To answer this question, we investigate a number of feature-extraction techniques that have proven to be successful in the context of dimensionality reduction

  • Following the results reported by Wang et al [13], we solve the imbalanced distribution of classes in the blood-brain barrier (BBB) permeability dataset by applying (SMOTE)

Read more

Summary

Introduction

The rapid development of technology had led to explosive growth in data in many fields. Drug discovery has benefited from the computational approaches for synthesizing. The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Callico. Chemical synthesis and virtual screening enabled the fast-paced generation of biological and chemical data and automated modeling [1]. This resulted in a need for practical methods to model the relationship between molecular structures and properties [2].

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call