Abstract Prediction models have been widely used for many purposes in cancer research, including calling mutation status, identifying cancer subtype, and performing prognostic analysis. A fundamental assumption in supervised machine learning is that the data to be classified is derived from the same distribution as the data used to train the classifier. However, challenges in data acquisition often mean few or no labeled examples are available for the distribution of interest. For instance, sample sizes may be insufficient to train on rare cancer types, and technological limitations can hinder label generation, for instance a lack of simultaneous profiling of gene expression and mutation information in single-cell data. For such situations where labeled target data is limited, the field of domain adaptation and transfer learning has established principled ways to develop predictors for the data of interest (target data) using labeled data from a similar but distinct distribution (source data). One recent method, weighted elastic net domain adaptation (wenda), leverages the complex interactions between features (such as genes) to optimize a model’s predictive power on both source and target datasets. It learns the dependency structure between features and prioritizes those that are similar across distributions. This has previously been shown to significantly improve accuracy on predictions from a mismatched distribution, overcoming the limitations of traditional supervised models. Unfortunately, wenda requires training a Gaussian process model for each feature separately, which is computationally expensive and resists parallelization, making it infeasible for researchers to use at genome-scale. We have developed and implemented a modified form of the underlying algorithm, called wenda_gpu, which allows for fast, efficient model training for genome-scale datasets on a single GPU-enabled computer. Our implementation exploits both quasi-Newtonian parameter optimization and the computational power of GPUs for significant speedups without sacrificing accuracy. Our implementation is able to tackle training tasks on data at the scale of The Cancer Genome Atlas (TCGA), which was infeasible without our technical advances. We demonstrate the use of wenda_gpu on a range of TCGA-scale prediction tasks, making it possible to build accurate, predictive models that generalize to target datasets where supervised models could not be trained due to the lack of labeled data. We also trained models from gene expression data for cross-cancer type mutation prediction, which outperformed a regular elastic net. We anticipate that wenda_gpu will enable researchers to build accurate predictive models in cases where supervised models were previously not possible due to lack of labeled data, including rare cancers. Citation Format: Ariel A. Hippen, Jake Crawford, Jacob R. Gardner, Casey S. Greene. Efficient domain adaptation for cancer mutation prediction [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 1222.
Read full abstract