Epigenetics involves reversible modifications in gene expression without altering the genetic code itself. Among these modifications, histone deacetylases (HDACs) play a key role by removing acetyl groups from lysine residues on histones. Overexpression of HDACs is linked to the proliferation and survival of tumor cells. To combat this, HDAC inhibitors (HDACi) are commonly used in cancer treatments. However, pan-HDAC inhibition can lead to numerous side effects. Therefore, isoform-selective HDAC inhibitors, such as HDAC3i, could be advantageous for treating various medical conditions while minimizing off-target effects. To date, computational approaches that use only the SMILES notation without any experimental evidence have become increasingly popular and necessary for the initial discovery of novel potential therapeutic drugs. In this study, we develop an innovative and high-precision stacked-ensemble framework, called Stack-HDAC3i, which can directly identify HDAC3i using only the SMILES notation. Using an up-to-date benchmark dataset, we first employed both molecular descriptors and Mol2Vec embeddings to generate feature representations that cover multi-view information embedded in HDAC3i, such as structural and contextual information. Subsequently, these feature representations were used to train baseline models using nine popular ML algorithms. Finally, the probabilistic features derived from the selected baseline models were fused to construct the final stacked model. Both cross-validation and independent tests showed that Stack-HDAC3i is a high-accuracy prediction model with great generalization ability for identifying HDAC3i. Furthermore, in the independent test, Stack-HDAC3i achieved an accuracy of 0.926 and Matthew’s correlation coefficient of 0.850, which are 0.44–6.11% and 0.83–11.90% higher than its constituent baseline models, respectively.
Read full abstract