Abstract

Single-cell genomic technologies provide an unprecedented opportunity to define molecular cell types in a data-driven fashion, but present unique data integration challenges. Many analyses require “mosaic integration”, including both features shared across datasets and features exclusive to a single experiment. Previous computational integration approaches require that the input matrices share the same number of either genes or cells, and thus can use only shared features. To address this limitation, we derive a nonnegative matrix factorization algorithm for integrating single-cell datasets containing both shared and unshared features. The key advance is incorporating an additional metagene matrix that allows unshared features to inform the factorization. We demonstrate that incorporating unshared features significantly improves integration of single-cell RNA-seq, spatial transcriptomic, SNARE-seq, and cross-species datasets. We have incorporated the UINMF algorithm into the open-source LIGER R package (https://github.com/welch-lab/liger).

Highlights

  • Single-cell genomic technologies provide an unprecedented opportunity to define molecular cell types in a data-driven fashion, but present unique data integration challenges

  • The key innovation of UINMF is the introduction of an unshared metagene matrix U to the iNMF objective function, incorporating features that belong to only one, or a subset, of the datasets when estimating metagenes and cell factor loadings

  • By including an unshared metagene matrix (Ui), we provide the capability to include unshared features during each iteration of the optimization algorithm (Fig. 1a)

Read more

Summary

Introduction

Single-cell genomic technologies provide an unprecedented opportunity to define molecular cell types in a data-driven fashion, but present unique data integration challenges. Previous computational integration approaches require that the input matrices share the same number of either genes or cells, and can use only shared features To address this limitation, we derive a nonnegative matrix factorization algorithm for integrating single-cell datasets containing both shared and unshared features. Some more recent methods, such as Seurat v4’s Weighted Nearest Neighbor (WNN) algorithm[13], are designed for datasets containing multiple modalities measured within the same cells, while other approaches focus on integrating modalities from different single cells into a shared latent space. Well-established methods for multi-omic integration of bulk data, such as similarity network fusion and iCluster[16,17], fall into this category, as well as recent methods for single-cell datasets with multiple modalities per cell such as MOFA+, totalVI, and the Seurat v4 weighted nearest neighbors algorithm[18,19,20].

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call