Abstract

In silico models to predict which tumors will respond to a given drug are necessary for Precision Oncology. However, predictive models are only available for a handful of cases (each case being a given drug acting on tumors of a specific cancer type). A way to generate predictive models for the remaining cases is with suitable machine learning algorithms that are yet to be applied to existing in vitro pharmacogenomics datasets. Here, we apply XGBoost integrated with a stringent feature selection approach, which is an algorithm that is advantageous for these high-dimensional problems. Thus, we identified and validated 118 predictive models for 62 drugs across five cancer types by exploiting four molecular profiles (sequence mutations, copy-number alterations, gene expression, and DNA methylation). Predictive models were found in each cancer type and with every molecular profile. On average, no omics profile or cancer type obtained models with higher predictive accuracy than the rest. However, within a given cancer type, some molecular profiles were overrepresented among predictive models. For instance, CNA profiles were predictive in breast invasive carcinoma (BRCA) cell lines, but not in small cell lung cancer (SCLC) cell lines where gene expression (GEX) and DNA methylation profiles were the most predictive. Lastly, we identified the best XGBoost model per cancer type and analyzed their selected features. For each model, some of the genes in the selected list had already been found to be individually linked to the response to that drug, providing additional evidence of the usefulness of these models and the merits of the feature selection scheme.

Highlights

  • Large-scale cancer in vitro pharmacogenomics databases have been generated over the last six years

  • The resulting pharmacogenomics databases have in turn spurred the generation of a range of computational models for drug response prediction in cancer cell lines [7,8,9]

  • Data-driven feature selection has been employed to mitigate this form of overfitting, e.g., studies using feature selection embedded in Random Forest (RF) [14,15,16,17]

Read more

Summary

Introduction

Large-scale cancer in vitro pharmacogenomics databases have been generated over the last six years. The most well-known are the Cancer Cell Line Encyclopedia (CCLE)-based project [1], Cancer Therapeutics Response Portal (CTRP) [2], and the Genomics of Drug Sensitivity in Cancer (GDSC) [3,4,5], which provide a deep molecular characterization (mutations, copy-number alterations, DNA methylations, and gene expression levels) of large panels of cell lines prior to being tested with hundreds of drugs. Those datasets have been growing ever since; for example, the latest GDSC study employed 265 drugs characterized in 1001 cell lines, with 990 cell line drug responses being readily available. To tackle these potential issues, we will later introduce an adaptive and data-driven feature selection scheme to build models of optimal complexity

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call