Extension of pQSAR: Ensemble Model Generated by Random Forest and Partial Least Squares Regressions

Byung Chun Kim,Dosang Joe,Gangjoon Yoon,Youngho Woo,Yongkuk Kim

doi:10.1109/access.2020.3027828

Abstract

Quantitative structure-activity relationship (QSAR) regression models are mathematical ones which relate the structural properties of chemicals to the potencies of the biological activities of the chemicals. In QSAR models, the physical and chemical information of the molecules is encoded into quantitative numbers called descriptors. Recently, experimental test results (profiles) have been used as descriptors of chemicals. Profile QSAR 2.0 (pQSAR) model suggested by Martin et al. , is a multitask, two step machine learning prediction method with a combination of random forest regressions (RFRs) and partial least squares regression (PLSR). In pQSAR model, one fills the profile table’s missing values with RFRs and then builds PLSR using the profile predictions. Note that in the second step of the pQSAR method, PLSR’s predictor variables are profiles; so activity values, and the response variables are also activity values. Thus we can use the PLSRs to update the profile table and then repeat the second step. In this work, we propose an extended model of pQSAR generated by RFRs and PLSRs. Experiment of updating the given full initially predicted profile table by two kinds of prediction models, RFRs and PLSRs, has been conducted iteratively for the PKIS and ChEMBL data sets. Even though prediction performance of individual combination of RFRs and PLSRs varies, the average of the all possible predicted profile tables for given iteration shows better performance. This ensemble model has better prediction performance in sense of Pearson’s $R^{2}$ compared to that of the pQSAR model.

Highlights

The first step in the rational drug design is to discover the hit compounds which can possibly activate or inhibit the enzyme such as a protein kinase
In repeatedly updating the profile data, we use the row vectors as representation vectors of the compounds which are applied to the random forest regressions (RFRs) and partial least squares regression (PLSR)
We compare the performance with RFRs applied for the initialization as given in (2) as well as the Profile QSAR 2.0 (pQSAR) model suggested by Martin et al [27]–[29] as a baseline model because the motive that led us to build the proposed model was inspired by the pSQAR model and began with an attempt to improve the performance of this model

Summary

INTRODUCTION

The first step in the rational drug design is to discover the hit compounds which can possibly activate or inhibit the enzyme such as a protein kinase. Experimental in vivo or in vitro test results have been introduced as descriptors to fill missing biological values [24]–[26] Among these attempts, Martin et al [27]–[29] introduced a distinguishable predictive ensemble method using bioactivity assay data with missing values, named profile QSAR 2.0 (pQSAR). We propose an extended ensemble model of pQSAR using RFRs and PLSRs to improve the performance of a kinase-compound bio-activity prediction method. 94.5% of the values (1,983,209 out of 159 assays × 13,192 compounds) are missing It appears that the proposed method enhances the predictive performances for pQSAR and RFR with the performance improvement for almost all essays (154 except 5).

ITERATION METHOD OF IMPROVING A MODEL PERFORMANCE

EVALUATION OF GOODNESS OF A PREDICTION MODEL BASED ON THE PEARSON’s R2

PREDICTION METHOD BY MUTUAL COMPLEMENT OF

Findings

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 19	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Extension of pQSAR: Ensemble Model Generated by Random Forest and Partial Least Squares Regressions

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Mapping pasture biomass in Mongolia using Partial Least Squares, Random Forest regression and Landsat 8 imagery
Munkhdulam Otgonbayar ... Amarsaikhan Damdinsuren
International Journal of Remote Sensing | VOL. 40
Munkhdulam Otgonbayar, et. al.Munkhdulam Otgonbayar ... Amarsaikhan Damdinsuren
13 Nov 2018
International Journal of Remote Sensing | VOL. 40

QSAR Study on Antioxidant Tripeptides and the Antioxidant Activity of the Designed Tripeptides in Free Radical Systems.
Nan Chen ... Bo Yao
Molecules | VOL. 23
Nan Chen, et. al.Nan Chen ... Bo Yao
10 Jun 2018
Molecules | VOL. 23

Estimating the total nitrogen content of Aquilaria sinensis leaves based on a hybrid feature selection algorithm and image data from a modified digital camera
Zhulin Chen ... Shanshan Sun
Biosystems Engineering | VOL. 213
Zhulin Chen, et. al.Zhulin Chen ... Shanshan Sun
09 Dec 2021
Biosystems Engineering | VOL. 213

KAJIAN SIMULASI PERBANDINGAN METODE REGRESI KUADRAT TERKECIL PARSIAL, SUPPORT VECTOR MACHINE, DAN RANDOM FOREST
Asep Andri Fauzi ... Anik Djuraidah
Indonesian Journal of Statistics and Its Applications | VOL. 4
Asep Andri Fauzi, et. al.Asep Andri Fauzi ... Anik Djuraidah
28 Feb 2020
Indonesian Journal of Statistics and Its Applications | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Extension of pQSAR: Ensemble Model Generated by Random Forest and Partial Least Squares Regressions

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access