CoGAPS 3: Bayesian non-negative matrix factorization for single-cell analysis with asynchronous updates and sparse data structures

Thomas D Sherman,Tiger Gao,Elana J Fertig

doi:10.1186/s12859-020-03796-9

Thomas D Sherman, Tiger Gao + Show 1 more

Open Access

https://doi.org/10.1186/s12859-020-03796-9

Copy DOI

Abstract

BackgroundBayesian factorization methods, including Coordinated Gene Activity in Pattern Sets (CoGAPS), are emerging as powerful analysis tools for single cell data. However, these methods have greater computational costs than their gradient-based counterparts. These costs are often prohibitive for analysis of large single-cell datasets. Many such methods can be run in parallel which enables this limitation to be overcome by running on more powerful hardware. However, the constraints imposed by the prior distributions in CoGAPS limit the applicability of parallelization methods to enhance computational efficiency for single-cell analysis.ResultsWe developed a new software framework for parallel matrix factorization in Version 3 of the CoGAPS R/Bioconductor package to overcome the computational limitations of Bayesian matrix factorization for single cell data analysis. This parallelization framework provides asynchronous updates for sequential updating steps of the algorithm to enhance computational efficiency. These algorithmic advances were coupled with new software architecture and sparse data structures to reduce the memory overhead for single-cell data.ConclusionsAltogether our new software enhance the efficiency of the CoGAPS Bayesian matrix factorization algorithm so that it can analyze 1000 times more cells, enabling factorization of large single-cell data sets.

Highlights

Bayesian factorization methods, including Coordinated Gene Activity in Pattern Sets (CoGAPS), are emerging as powerful analysis tools for single cell data
Non-negative matrix factorization (NMF) techniques have emerged as powerful tools to identify the cellular and molecular features that are associated with distinct biological processes from single cell data [3,4,5, 7, 14, 16]
The computational cost of implementing these approaches may be prohibitive for large single cell datasets

Summary

Results

We varied the level of sparsity in each data set and tested the amount of memory used with and without the sparse optimization enabled. We ran a benchmark on a subset of 37,000 human immune cells (umbilical cord blood) and only kept the 3000 highest variance genes. In this case, enabling the sparse optimization reduced memory overhead by 82% and cut the run time by 74%. When using 4 threads instead of 1, the run time was cut by an additional 36% We further ran this benchmark on random subsets of this data to benchmark performance as a function of number of genes and number of cells (Fig. 2). We observe that the running time in these simulations increases linearly with both the number of cells and the number of genes as expected

Conclusions

Background