Abstract

MotivationIdentifying rare subpopulations of cells is a critical step in order to extract knowledge from single-cell expression data, especially when the available data is limited and rare subpopulations only contain a few cells. In this paper, we present a data mining method to identify small subpopulations of cells that present highly specific expression profiles. This objective is formalized as a constrained optimization problem that jointly identifies a small group of cells and a corresponding subset of specific genes. The proposed method extends the max-sum submatrix problem to yield genes that are, for instance, highly expressed inside a small number of cells, but have a low expression in the remaining ones.ResultsWe show through controlled experiments on scRNA-seq data that the MicroCellClust method achieves a high F1 score to identify rare subpopulations of artificially planted human T cells. The effectiveness of MicroCellClust is confirmed as it reveals a subpopulation of CD4 T cells with a specific phenotype from breast cancer samples, and a subpopulation linked to a specific stage in the cell cycle from breast cancer samples as well. Finally, three rare subpopulations in mouse embryonic stem cells are also identified with MicroCellClust. These results illustrate the proposed method outperforms typical alternatives at identifying small subsets of cells with highly specific expression profiles.Availabilityand implementationThe R and Scala implementation of MicroCellClust is freely available on GitHub, at https://github.com/agerniers/MicroCellClust/ The data underlying this article are available on Zenodo, at https://dx.doi.org/10.5281/zenodo.4580332.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Next-generation single-cell sequencing technologies, such as scRNA-seq, provide an important source of data in nowadays medical research

  • We propose MicroCellClust, a new data mining method to identify small subpopulations of cells with specific gene expression

  • MicroCellClust is a multivariate method that jointly looks for a small group of cells and the corresponding marker genes

Read more

Summary

Introduction

Next-generation single-cell sequencing technologies, such as scRNA-seq, provide an important source of data in nowadays medical research. Several techniques have been developed toward this objective (Kiselev et al, 2019) They generally tend to group cells in relatively large clusters, but tend to miss subpopulations which only amount for a small fraction of the cells.Figure 1a–c exhibits such a behavior when running SC3 (Kiselev et al, 2017), a popular method designed for single-cell clustering, on a collection of samples made of activated (GARPþ) regulatory T cells and CD8 T cells from the same human patient. These two types of lymphocytes have very distinct functions, which should be reflected in their gene expression. When the GARPþ Tregs only represent a smaller fraction of the data (here 10%), SC3 clearly fails to identify them as forming a separate and specific cluster

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call