Sfaira accelerates data and model reuse in single cell genomics

David S Fischer,Fabian J Theis,Olle Holmberg,Martin König,Hananeh Aliee,Sophie Tritschler,Leander Dony,Lukas Heumos,Abdul Moeed,Luke Zappia

doi:10.1186/s13059-021-02452-6

Abstract

Single-cell RNA-seq datasets are often first analyzed independently without harnessing model fits from previous studies, and are then contextualized with public data sets, requiring time-consuming data wrangling. We address these issues with sfaira, a single-cell data zoo for public data sets paired with a model zoo for executable pre-trained models. The data zoo is designed to facilitate contribution of data sets using ontologies for metadata. We propose an adaption of cross-entropy loss for cell type classification tailored to datasets annotated at different levels of coarseness. We demonstrate the utility of sfaira by training models across anatomic data partitions on 8 million cells.

Highlights

Many single-cell data sets are currently published in various databases in different formats, such as custom formats on GEO, manuscript supplements with tables of cell type annotations, or streamlined formats on Human Cell Atlas servers
We identify two core issues with the current state of data and model re-use in single-cell genomics
The sfaira model zoo is designed to be model agnostic and to be as a unified front-end for serving and receiving models, thereby enabling transfer of models from developers to users. In addition to these practical advantages of a data and model zoo, we address the issue of interpretability and generalizability of models

Summary

Introduction

Many single-cell data sets are currently published in various databases in different formats, such as custom formats on GEO, manuscript supplements with tables of cell type annotations, or streamlined formats on Human Cell Atlas servers. In smaller data sets, rare cell states can often only be properly analyzed after integration with larger reference atlas data sets. This integration is time-intensive and requires a prior analysis of the reference data set. Data processing and cell type annotation are repeated elements of these pipelines that are time-intensive for analysts because of the complexity of the pipelines used. Data processing and cell type annotation are repeated elements of these pipelines that are time-intensive for analysts because of the complexity of the pipelines used4 Both computing an embedding and clustering require basic preprocessing, such as scaling, logtransformation, and highly variable feature selection.

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Genome Biology	Publication Date: Aug 25, 2021
Citations: 21	License type: open-access

R Discovery Prime

R Discovery Prime

Sfaira accelerates data and model reuse in single cell genomics

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genome Biology

Lead the way for us

Similar Papers

Wiretapping into microbial interactions by single cell genomics.
Ramunas Stepanauskas
Frontiers in microbiology | VOL. 6
Ramunas StepanauskasRamunas Stepanauskas
08 Apr 2015
Frontiers in microbiology | VOL. 6

Subject clustering by IF-PCA and several recent methods.
Dieyi Chen ... Zheng Tracy Ke
Frontiers in genetics | VOL. 14
Dieyi Chen, et. al.Dieyi Chen ... Zheng Tracy Ke
23 May 2023
Frontiers in genetics | VOL. 14

CHARTS: a web application for characterizing and comparing tumor subpopulations in publicly available single-cell RNA-seq data sets
Matthew N Bernstein ... Ron Stewart
BMC Bioinformatics | VOL. 22
Matthew N Bernstein, et. al.Matthew N Bernstein ... Ron Stewart
23 Feb 2021
BMC Bioinformatics | VOL. 22

Coupled co-clustering-based unsupervised transfer learning for the integrative analysis of single-cell genomic data.
Pengcheng Zeng ... Zhixiang Lin
Briefings in Bioinformatics | VOL. 22
Pengcheng Zeng, et. al.Pengcheng Zeng ... Zhixiang Lin
07 Dec 2020
Briefings in Bioinformatics | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sfaira accelerates data and model reuse in single cell genomics

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genome Biology