Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

Saptarshi Bej,Olaf Wolkenhauer,Anne-Marie Galow,Robert David,Markus Wolfien

doi:10.1186/s12859-021-04469-x

Abstract

BackgroundThe research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. In particular, the area for the detection of rare cells was highly facilitated by this technology. However, an automated, unbiased, and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it is usually necessary to generate further specific datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare-cell subpopulations constitute an imbalanced classification problem. We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class.ResultsWe demonstrate the effectiveness of our method for three independent use cases, each consisting of already published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8635). This use case was designed to take a larger imbalance ratio (~1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (~1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single-cell capture procedures and the impact of “less” rare-cell types. The third dataset refers to the murine data of the Allen Brain Atlas, including more than 1 million cells. For validation purposes only, all datasets have also been analyzed traditionally using common data analysis approaches, such as the Seurat workflow.ConclusionsIn comparison to baseline testing without oversampling, our approach identifies rare-cells with a robust precision-recall balance, including a high accuracy and low false positive detection rate. A practical benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis in R and Python is publicly available at FairdomHub, as well as GitHub, and can easily be transferred to identify other rare-cell types.

Highlights

The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly
Single-cell RNA-sequencing, as well as single-nuclei RNA-sequencing, open up a transcriptome-wide gene expression measurement at singlecell level, enabling cell type cluster identification, the arrangement of populations of cells according to novel hierarchies, and the identification of cells transitioning between individual states [1]
Use case preparation To evaluate the potential of single-cell synthetic oversampling (sc-SynO) to precisely annotate cell populations in newly generated data, we demonstrate three use cases by utilizing already published single-cell and nuclei RNA-Seq datasets

Summary

Introduction

The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. Single-cell RNA-sequencing (scRNA-Seq), as well as single-nuclei RNA-sequencing (snRNA-Seq), open up a transcriptome-wide gene expression measurement at singlecell level, enabling cell type cluster identification, the arrangement of populations of cells according to novel hierarchies, and the identification of cells transitioning between individual states [1] This facilitates the investigation of underlying structures in tissue, organism development, diseases, as well as the identification of unique subpopulations in cell populations that were so far perceived as homogeneous. Many studies are interested in specialized cells (e.g., cancer cells, cardiac pacemaker cells) with an occurrence of less than 1 in 1000 The identification of such clusters, solely based on unsupervised clustering of a single dataset, remains very challenging [6]. One possible solution requires a so-called cell atlas, as a curated reference system that systematically captures cell types and states, either tissue specific or across different tissues [7]

Methods

Results

Discussion

Conclusion