Abstract

A major cause of failed drug discovery programs is suboptimal target selection, resulting in the development of drug candidates that are potent inhibitors, but ineffective at treating the disease. In the genomics era, the availability of large biomedical datasets with genome-wide readouts has the potential to transform target selection and validation. In this study we investigate how computational intelligence methods can be applied to predict novel therapeutic targets in oncology.We compared different machine learning classifiers applied to the task of drug target classification for nine different human cancer types. For each cancer type, a set of “known” target genes was obtained and equally-sized sets of “non-targets” were sampled multiple times from the human protein-coding genes. Models were trained on mutation, gene expression (TCGA), and gene essentiality (DepMap) data. In addition, we generated a numerical embedding of the interaction network of protein-coding genes using deep network representation learning and included the results in the modeling. We assessed feature importance using a random forests classifier and performed feature selection based on measuring permutation importance against a null distribution. Our best models achieved good generalization performance based on the AUROC metric. With the best model for each cancer type, we ran predictions on more than 15,000 protein-coding genes to identify potential novel targets. Our results indicate that this approach may be useful to inform early stages of the drug discovery pipeline.

Highlights

  • A major cause of failed drug discovery programs is suboptimal target selection, resulting in the development of drug candidates that are potent inhibitors, but ineffective at treating the disease

  • We have recently shown that the information captured by such an embedding can be relevant for drug target i­dentification[15]

  • Through the integration of gene-gene interaction data via network embedding features, combined with a robust feature selection approach, well-performing models could be generated for all nine cancer types (AUROCs between 0.75 for leukemia and 0.88 for kidney cancer)

Read more

Summary

Introduction

A major cause of failed drug discovery programs is suboptimal target selection, resulting in the development of drug candidates that are potent inhibitors, but ineffective at treating the disease. Examples for the field of oncology include The Cancer Genome Atlas (TCGA)[5] or the Cancer Dependency Map (DepMap)[6] In their manual analyses, experts typically consider each data source and data type (e.g. mutations and gene expression) independently, and weigh information for each individual source against each other using subjective criteria. In the novel target identification field, Kumari et al.[9] proposed an improved random forest (RF) algorithm that integrates bootstrap and rotation feature matrix components, to discriminate human drug targets from non-drug targets They applied a synthetic minority over-sampling technique to alleviate the class (target/non-target) unbalance problem. Compositions, amino acid property group compositions and dipeptide composition, and achieved an accuracy of 85.3% using leave-one-out cross-validation This approach looked at drug targets in a very general sense, without considering any specific disease associations

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call