Abstract Background: Determining a patient’s response to a specific therapy is a vital step in developing personalized cancer treatment. Personalized treatment relies on two key technological advancements: algorithms tailored to each patient’s molecular profiles, and comprehensive datasets detailing the effects of patient diversity on drug response. Despite recent advancements in machine learning-based models, each new algorithm requires curation of numerous datasets, which poses a significant burden to researchers who need to evaluate and compare various algorithms. Currently, datasets include extensive genomics, transcriptomics, proteomics and metabolomics measurements along with drug response data (i.e., cell viability) for each cell line, collected through the Cancer Cell Line Encyclopedia (CCLE) and other initiatives. Additional omics data is available for tumor organoids and patient samples generated by consortia such as Human Cancer Model Initiative (HCMI) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC). However, to date there has been little public effort to annotate and harmonize these data across repositories. Methods: To address this need, we developed the Python package coderdata that collates the most up-to-date cancer omics and drug response data. We curated four public cancer datasets, establishing reproducible pipelines that standardize data retrieval, formatting, and integration so the datasets can be built on a cluster if needed. The datasets undergo bi-monthly synchronization with FigShare, ensuring the data is up to date and available. Scientists can use coderdata to retrieve and format data on the command line interface as well as within Python scripts. Results: Coderdata is a straightforward Python package that will enable the development and benchmarking of diverse machine learning applications in cancer research. At its core, it represents reproducible data harmonization that enables heterogeneous datasets to be analyzed in bulk. This Python package downloads and standardizes data from multiple consortia-related resources including HCMI, CPTAC, CCLE, and BeatAML, together representing 4931 multi-omics samples from across 250+ cancer types. These datasets include cancer data metrics such as copy number, mutation, transcriptomics, proteomics, miRNA, methylation, and drug response data. Conclusions: The applications of machine learning in cancer biology are hampered by the distributed nature of existing datasets. As such, the collection and standardization of data by coderdata substantially reduces the time investment required for data curation in cancer research. Direct access to benchmark multi-omics and drug response datasets enables scientists to focus on algorithm development for tumor drug response prediction, potentially accelerating the discovery of therapeutic strategies. Citation Format: Jeremy Jacobson, Sydney Schwartz, M. Ryan Weil, Neeraj Kumar, Sara Gosline. The cancer omics and drug experimental response dataset (CODERData): A harmonized benchmark dataset for machine learning models of drug response prediction [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 6210.
Read full abstract