Abstract

The Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE) are two major studies that can be used to mine for therapeutic biomarkers for cancers of a large variety. Model validation using the two datasets however has proved challenging. Both predictions and signatures do not consistently validate well for models built on one dataset and tested on the other. While the genomic profiling seems consistent, the drug response data is not. Some efforts at harmonizing experimental designs has helped but not entirely removed model validation difficulties. In this paper, we present a partitioning strategy based on a data sharing concept which directly acknowledges a potential lack of concordance between datasets and in doing so, also allows for extraction of reproducible novel gene-drug interaction signatures as well as accurate test set predictions. We demonstrate these properties in a re-analysis of the GDSC and CCLE datasets.

Highlights

  • A typical pharmacogenomic workflow involves characterizing interesting compounds for dose-response effects on cancer cell lines, and doing functional genomic characterization in additional screens

  • The opportunity to study validation of models arose with the generation of two major pharmacogenomic datasets - the Genome Drug Sensitivity in Cancer (GDSC) project[1,2] and the Cancer Cell Line Encyclopedia (CCLE) project[3]

  • Given that the ultimate goal of such studies is to detect novel drug-gene interactions, successful validation implies that training data models of drug-gene interactions should yield low test set prediction errors of drug response and reproducibility of therapeutic genomic signatures on the test dataset

Read more

Summary

Introduction

A typical pharmacogenomic workflow involves characterizing interesting compounds for dose-response effects on cancer cell lines, and doing functional genomic characterization in additional screens. Given that the ultimate goal of such studies is to detect novel drug-gene interactions, successful validation implies that training data models of drug-gene interactions should yield low test set prediction errors of drug response and reproducibility of therapeutic genomic signatures on the test dataset. It should be noted that in general developing accurate signatures (those with true drug-gene interactions) does not always imply low prediction errors. Known to predict well (and these might yield reproducible signatures) - the test set error differences between overfit and accurate more sparse models is usually not large[7]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call