Abstract

scRNA-seq datasets are increasingly used to identify gene panels that can be probed using alternative technologies, such as spatial transcriptomics, where choosing the best subset of genes is vital. Existing methods are limited by a reliance on pre-existing cell type labels or by difficulties in identifying markers of rare cells. We introduce an iterative approach, geneBasis, for selecting an optimal gene panel, where each newly added gene captures the maximum distance between the true manifold and the manifold constructed using the currently selected gene panel. Our approach outperforms existing strategies and can resolve cell types and subtle cell state differences.

Highlights

  • Single-cell RNA sequencing is a fundamental approach for studying transcriptional heterogeneity within individual tissues, organs, and organisms

  • More recent technological advances such as single-cell multi-omics assays, CRISPR screens, and spatial transcriptomics go beyond measuring only the transcriptome, facilitating a more complete understanding of the features that underpin cellular function. In many of these cases, for a large number of spatial transcriptomics assays, selecting the set of genes to probe is an important parameter, which in turn necessitates the emergence of appropriate computational tools

  • We have shown that geneBasis outperforms existing methods, both in terms of computational speed and in identifying relevant sets of genes and that geneBasis selects genes that characterize both local and global axes of variation that can be recovered from a k-nearest neighbor (k-NN) graph representation of transcriptional similarities. geneBasis allows user knowledge to be directly incorporated by selecting, a priori, a set of genes of particular biological relevance, which are augmented by the algorithm

Read more

Summary

Introduction

Single-cell RNA sequencing (scRNA-seq) is a fundamental approach for studying transcriptional heterogeneity within individual tissues, organs, and organisms (reviewed in [1]). A key step in the analysis of scRNA-seq data is the selection of a set of representative features, typically a subset of genes, that capture variability in the data and that can be used in downstream analysis. Established approaches for feature selection leverage quantitative per gene metrics that aim to identify genes that display more variability than expected by chance across the population of cells under study. Used methods for detecting highly variable genes (HVG) utilize the relationship between mean and standard deviation of expression levels (reviewed in [2]), GiniClust leverages Gini indices [3], and M3Drop performs dropout-based feature selection [4]. A recently developed approach, scPNMF, further addresses the gene complexity problem by leveraging a Non-Negative Matrix Factorization (NMF) representation of scRNA-seq, with selected features being suggested to represent interesting biological variability in the data [6]. A recently developed approach, scPNMF, further addresses the gene complexity problem by leveraging a Non-Negative Matrix Factorization (NMF) representation of scRNA-seq, with selected features being suggested to represent interesting biological variability in the data [6]. scPNMF relies on the chosen dimension for the NMF representation and does not Missarova et al Genome Biology (2021) 22:333 directly compare informativeness between different factors, impeding the ability to compare the importance (i.e., gene weights) between different factors

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call