Abstract

Large single-cell atlases are now routinely generated to serve as references for analysis of smaller-scale studies. Yet learning from reference data is complicated by batch effects between datasets, limited availability of computational resources and sharing restrictions on raw data. Here we introduce a deep learning strategy for mapping query datasets on top of a reference called single-cell architectural surgery (scArches). scArches uses transfer learning and parameter optimization to enable efficient, decentralized, iterative reference building and contextualization of new datasets with existing references without sharing raw data. Using examples from mouse brain, pancreas, immune and whole-organism atlases, we show that scArches preserves biological state information while removing batch effects, despite using four orders of magnitude fewer parameters than de novo integration. scArches generalizes to multimodal reference mapping, allowing imputation of missing modalities. Finally, scArches retains coronavirus disease 2019 (COVID-19) disease variation when mapping to a healthy reference, enabling the discovery of disease-specific cell states. scArches will facilitate collaborative projects by enabling iterative construction, updating, sharing and efficient use of reference atlases.

Highlights

  • Large single-cell reference atlases[1–4] comprising millions[5] of cells across tissues, organs, developmental stages and conditions are routinely generated by consortia such as the Human Cell Atlas[6]

  • We propose a TL and fine-tuning strategy to leverage existing conditional neural network models and transfer them to new datasets, called ‘architecture surgery’, as implemented in the scArches pipeline. scArches is a fast and scalable tool for updating, sharing and using reference atlases trained with a variety of neural network models

  • A common approach to integrate such datasets is to use a conditional variational autoencoder (CVAE) (for example, single-cell variational inference[29], transfer variational autoencoder30) that assigns a categorical label Si to each dataset that corresponds to the study label

Read more

Summary

Introduction

Large single-cell reference atlases[1–4] comprising millions[5] of cells across tissues, organs, developmental stages and conditions are routinely generated by consortia such as the Human Cell Atlas[6]. These references help to understand the cellular heterogeneity that constitutes natural and inter-individual variation, aging, environmental influences and disease. Current TL approaches in genomics do not account for technical effects within and between the reference and query[19] and lack of systematic retraining with query data[20–23] These limitations can lead to spurious predictions on query data with no or small overlap in cell types, tissues or species[24,25]. We demonstrate the features of scArches using single-cell datasets ranging from pancreas to whole-mouse atlases and immune cells from patients with COVID-19. scArches is able to iteratively update a pancreas reference, transfer labels or unmeasured data modalities between reference atlases and query data and map COVID-19 data onto a healthy reference while preserving disease-specific variation

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.