Abstract Background: Due to sizeable batch effect challenges, published scRNA datasets remain siloed, with no tools or packages yet demonstrating an ability to integrate more than a handful of datasets into a unified atlas. This has precluded generation of a large-unified Atlas akin to The Cancer Genome Atlas (TCGA) for scRNA, despite massive demand for such a resource. Methods: We built the first training scRNA dataset that we know of that relies on using cell-type labels from published studies as a ground-truth metric. Using this dataset, we evaluated and trained a variety of models specifically for the task of integrating disparate data into a unified space. Further we developed a specific framework for evaluating how well unsupervised models perform at the task of integrating disparate data, using a new approach reliant on leave-one out validation of ‘unseen’ datasets. Using deep-learning models, that performed best on the training dataset, we scaled integration of over 50 public datasets focused on solid cancers, that collectively contain over 1000 patient samples worth of data covering over 20 indications. Results: The pan-cancer scRNA atlas produced by the above workflow is an order of magnitude larger than previous scRNA datasets and the first to span many indications alongside adjacent and separate normal tissue data. Analysis of this atlas reveals novel axes of variation in the tumor microenvironment linked to Cancer Associated Fibroblast (CAF) biology. For example: (a) We find CAF high samples vs. cancer high samples are enriched for T-cells in a naïve state; (b) Cancer vs. CAF rich samples result in variation in M2 like macrophage signatures; this compartmentalization is also seen in spatial RNA data; (c) A spectrum of CAF, perivascular, and endothelial like states is also observed indicating potential cell-type plasticity. Collectively, these observations identify novel biology and variation in the tumor microenvironment that will likely apply to many ongoing experimental projects and therapeutic programs. Conclusions: We’ve used deep-learning to build one of the largest scRNA atlases to date, and potentially the first that will progressively release models and data as an open-source package to benefit the wider community. We anticipate the ability to resolve target expression at the single cell level will greatly enhance our understanding of the tumor microenvironment, as it aids our own efforts to drug CAF biology and the tumor stroma. Citation Format: Javier Díaz-Mejía, Swechha X, Dylan Mendonca, Octavian Focsa, Chris Harvey, Mike Briskin, Sam Cooper. Unification of over 50 published single-cell RNA datasets covering over a 1000 patient samples with deep-learning reveals novel axes of tumor microenvironment variation [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 3813.
Read full abstract