Hierarchical progressive learning of cell identities in single-cell data

Lieke Michielsen,Ahmed Mahfouz,Marcel J T Reinders

doi:10.1038/s41467-021-23196-8

Lieke Michielsen, Ahmed Mahfouz + Show 1 more

Open Access

https://doi.org/10.1038/s41467-021-23196-8

Copy DOI

Abstract

Supervised methods are increasingly used to identify cell populations in single-cell data. Yet, current methods are limited in their ability to learn from multiple datasets simultaneously, are hampered by the annotation of datasets at different resolutions, and do not preserve annotations when retrained on new datasets. The latter point is especially important as researchers cannot rely on downstream analysis performed using earlier versions of the dataset. Here, we present scHPL, a hierarchical progressive learning method which allows continuous learning from single-cell data by leveraging the different resolutions of annotations across multiple datasets to learn and continuously update a classification tree. We evaluate the classification and tree learning performance using simulated as well as real datasets and show that scHPL can successfully learn known cellular hierarchies from multiple datasets while preserving the original annotations. scHPL is available at https://github.com/lcmmichielsen/scHPL.

Highlights

Supervised methods are increasingly used to identify cell populations in single-cell data
Cells in single-cell RNA-sequencing datasets are primarily annotated using clustering and visual exploration techniques, i.e., cells are first clustered into populations that are subsequently named based on the expression of marker genes
We developed scHPL, a hierarchical progressive learning approach to learn a classification tree using multiple labeled datasets (Fig. 1A) and use this tree to predict the labels of a new, unlabeled dataset (Fig. 1B)

Summary

Introduction

Supervised methods are increasingly used to identify cell populations in single-cell data. Cells in single-cell RNA-sequencing (scRNA-seq) datasets are primarily annotated using clustering and visual exploration techniques, i.e., cells are first clustered into populations that are subsequently named based on the expression of marker genes. This is time-consuming and subjective[2]. The task complexity is gradually increased, for instance, by adding more classes, but it is essential that the knowledge of the previous classes is preserved[16,17] This strategy allows combining information of multiple existing datasets and retaining the possibility to add more datasets afterward. A standardized nomenclature for these clusters is missing[18], so the relationship between cell populations defined in different datasets is often unknown

Methods

Results

Conclusion