ProtoCloud: A prototypical self-explaining model for single-cell analysis.
ProtoCloud: A prototypical self-explaining model for single-cell analysis.
- Peer Review Report
- 10.7554/elife.75624.sa1
- Jan 31, 2022
Single-cell chromatin accessibility analysis reveals the intricate regulatory landscape of mouse testicular development, uncovering novel cell subpopulations and transcription factors, and offering valuable insights into the molecular mechanisms driving germ cell and somatic cell maturation.
- Research Article
- 10.1158/1538-7445.am2024-lb240
- Apr 5, 2024
- Cancer Research
Single-nucleus joint ATAC- and RNA-sequencing (snMultiome) can be used to identify functionally divergent cell subpopulations based on their transcriptomic and epigenetic profiles within complex samples. Accurate cell type annotation is critical to successful snMultiome data analysis. Several computational methods have been developed for automatic annotation. Traditional cell type annotation methods initially cluster cells using unsupervised learning methods based on the gene expression profiles, then label the clusters using aggregated cluster-level expression profiles and marker genes. These methods rely heavily on the clustering results. As the purity of clusters cannot be guaranteed, false detection of cluster features may lead to incorrect annotations. Further, canonical cell surface markers may not always be suitable to be applied in single-nucleus RNA-seq studies because single-nucleus RNA-seq generally yields lower detected transcript numbers compared to typical single-cell RNA-seq. Moreover, cell type marker genes in the snRNA-seq data may differ from the ones obtained with scRNA-seq data, reflecting biological differences in the cytoplasmic and nuclear RNA pools. Lastly, the data obtained from malignant cells are best left out in establishing cell type reference data because they are too heterogeneous and patient-specific. Reference-based automated algorithms such as SingleR enable quick and unbiased classifications by leveraging a collection of built-in reference data sets for human (e.g. Human Primary Cell Atlas (microarray-based) and the combined Blueprint Epigenomics and Encode data set (RNA-seq-based)). Still, SingleR may return erroneous cell type classifications. Our dataset was generated using the 10x Genomics snMultiome platform to yield 296,557 nuclei from 82 frozen breast tumors, representing patients from diverse genetic ancestral background. Using these data, we sought to improve the accuracy of cell type annotation by SingleR. To achieve this, we first separated malignant and non-malignant cells based on DNA copy number aberrations (aneuploidy) through CopyKAT. For cells determined to be non-malignant, we built the custom reference from snRNA-seq data set, recently made available by The Human Breast Cell Atlas, and then applied singleR with a custom reference where each cell type is represented by single-cells of that type, allowing a well-founded estimate of the confidence with which a cell type call can be made. Using this approach, we successfully identified 11 distinct cell types for non-malignant cells, including fibroblast, adipocyte, pericyte, basal, luminal-secretory, luminal-HR, myeloid, mast, vascular, lymphatic, and T-cells, which can then be further subclassified. Furthermore, we interrogated each cluster using known canonical markers and transferred the cell type labels to snATAC-seq. This approach enabled us to link peaks to genes in each cell type. We believe this new approach that refines SingleR can greatly improve accuracy and minimize misclassification when annotating cell types in breast tumors using snMultiome data. Citation Format: Huaitian Liu, Alexandra Harris, Brittany Jenkins-Lord, Tiffany H. Dorsey, Francis Makokha, Shahin Sayed, Gretchen Gierach, Stefan Ambs. Cell type annotation using singleR with custom reference for single-nucleus multiome data derived from frozen human breast tumors [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 2 (Late-Breaking, Clinical Trial, and Invited Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(7_Suppl):Abstract nr LB240.
- Research Article
- 10.1101/2025.06.05.658097
- Jun 8, 2025
- bioRxiv
ABSTRACTDimensionality reduction and clustering are critical steps in single-cell and spatial genomics studies. Here, we show that existing dimensionality reduction and clustering methods suffer from: (1) overfitting to the dominant patterns while missing unique ones, which impairs the detection and annotation of rare cell types and states, and (2) fitting to technical noise over biological signal. To address this, we developed DR-GEM, a self-supervised meta-algorithm that combines principles in distributionally robust optimization with balanced consensus machine learning. DR-GEM supervises itself by (1) using the reconstruction error to identify and reorient its attention to samples/cells that are otherwise poorly embedded, and (2) using balanced consensus learning as a mechanism to increase robustness and mitigate the impact of low-quality samples/cells. Applied to synthetic and real-world single cell ‘omics data, single cell resolution spatial transcriptomics, and Perturb-seq datasets, DR-GEM markedly and consistently outperforms existing methods in obtaining reliable embeddings, recovering rare cell types, filtering noise, and uncovering the underlying biology. In summary, this study surfaces and addresses a gap in single cell genomics and brings self-supervision to the realm of dimensionality reduction and clustering to better support data-driven discoveries.
- Abstract
- 10.1182/blood-2022-162180
- Nov 15, 2022
- Blood
Harmonizing the Annotation of Hematopoietic Populations in Single-Cell Atlases with the Cell Marker Accordion
- Conference Article
- 10.1145/3665689.3665769
- Jan 26, 2024
Cell-type annotation in single-cell research is crucial for understanding the heterogeneity among different cell types within tissues, aiding in-depth exploration of biological processes and disease states. Traditional annotation methods often face limitations due to their reliance on clustering algorithms, which may not accurately capture complex biological variations, and their dependency on known marker genes, hindering the discovery of rare cell types. This article proposes a supervised annotation method, scSwin, applying the Swin Transformer architecture innovatively to the field of cell-type annotation. DeepInsight, a methodology for transforming non-image data to images, is utilized to convert single-cell RNA sequencing (scRNA-seq) data to images by comprehensively comparing interrelationships among multiple genes. Subsequently, these images are processed by the Swin Transformer for automatic feature extraction to identify the cell types of scRNA-seq samples. Extensive experiments validated the superior performance of scSwin on cell-type annotation.
- Research Article
38
- 10.1101/2023.04.16.537094
- Dec 13, 2023
- bioRxiv
Cell type annotation is an essential step in single-cell RNA-seq analysis. However, it is a time-consuming process that often requires expertise in collecting canonical marker genes and manually annotating cell types. Automated cell type annotation methods typically require the acquisition of high-quality reference datasets and the development of additional pipelines. We assessed the performance of GPT-4, a highly potent large language model, for cell type annotation, and demonstrated that it can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines. Evaluated across hundreds of tissue types and cell types, GPT-4 generates cell type annotations exhibiting strong concordance with manual annotations and has the potential to considerably reduce the effort and expertise needed in cell type annotation. We also developed GPTCelltype, an open-source R software package to facilitate cell type annotation by GPT-4.
- Research Article
1
- 10.54480/slrm.v3i3.45
- Oct 21, 2022
- Systematic Literature Review and Meta-Analysis Journal
Single-cell sequencing gives us the opportunity to analyze cells on an individual level rather than at a population level. There are different types of sequencing based on the stage and portion of the cell from where the data are collected. Among those Single Cell RNA seq is most widely used and most application of cell type annotation has been on Single-cell RNA seq data. Tools have been developed for automatic cell type annotation as manual annotation of cell type is time-consuming and partially subjective. There are mainly three strategies to associate cell type with gene expression profiles of single cell by using marker genes databases, correlating expression data, transferring levels by supervised classification. In this SLR, we present a comprehensive evaluation of the available tools and the underlying approaches to perform automated cell type annotations on scRNA-seq data.
- Research Article
25
- 10.1016/j.gpb.2022.04.001
- Apr 22, 2022
- Genomics, Proteomics & Bioinformatics
deCS: A Tool for Systematic Cell Type Annotations of Single-cell RNA Sequencing Data among Human Tissues
- Research Article
1
- 10.1186/s13059-025-03603-9
- Jun 11, 2025
- Genome Biology
Single-cell chromatin accessibility sequencing (scCAS) has proven invaluable for investigating the intricate landscape of epigenomic heterogeneity. We propose MINGLE, a mutual information-based interpretable framework that leverages cellular similarities and topological structures for accurate cell type annotation of scCAS data. Additionally, we introduce a convex hull-based strategy to effectively identify novel cell types. Extensive experiments demonstrate MINGLE’s superior annotation performance, particularly for rare and novel cell types, delivering valuable biological insights compared to existing methods. Moreover, MINGLE excels in cross-batch, cross-tissue, and cross-species scenarios, showing robustness to data imbalance and size, highlighting its versatility for complex annotation tasks.
- Research Article
14
- 10.1093/bib/bbad179
- May 14, 2023
- Briefings in Bioinformatics
Undoubtedly, single-cell RNA sequencing (scRNA-seq) has changed the research landscape by providing insights into heterogeneous, complex and rare cell populations. Given that more such data sets will become available in the near future, their accurate assessment with compatible and robust models for cell type annotation is a prerequisite. Considering this, herein, we developed scAnno (scRNA-seq data annotation), an automated annotation tool for scRNA-seq data sets primarily based on the single-cell cluster levels, using a joint deconvolution strategy and logistic regression. We explicitly constructed a reference profile for human (30 cell types and 50 human tissues) and a reference profile for mouse (26 cell types and 50 mouse tissues) to support this novel methodology (scAnno). scAnno offers a possibility to obtain genes with high expression and specificity in a given cell type as cell type-specific genes (marker genes) by combining co-expression genes with seed genes as a core. Of importance, scAnno can accurately identify cell type-specific genes based on cell type reference expression profiles without any prior information. Particularly, in the peripheral blood mononuclear cell data set, the marker genes identified by scAnno showed cell type-specific expression, and the majority of marker genes matched exactly with those included in the CellMarker database. Besides validating the flexibility and interpretability of scAnno in identifying marker genes, we also proved its superiority in cell type annotation over other cell type annotation tools (SingleR, scPred, CHETAH and scmap-cluster) through internal validation of data sets (average annotation accuracy: 99.05%) and cross-platform data sets (average annotation accuracy: 95.56%). Taken together, we established the first novel methodology that utilizes a deconvolution strategy for automated cell typing and is capable of being a significant application in broader scRNA-seq analysis. scAnno is available at https://github.com/liuhong-jia/scAnno.
- Research Article
10
- 10.1093/bib/bbae047
- Jan 22, 2024
- Briefings in bioinformatics
Cell-type annotation of single-cell RNA-sequencing (scRNA-seq) data is a hallmark of biomedical research and clinical application. Current annotation tools usually assume the simultaneous acquisition of well-annotated data, but without the ability to expand knowledge from new data. Yet, such tools are inconsistent with the continuous emergence of scRNA-seq data, calling for a continuous cell-type annotation model. In addition, by their powerful ability of information integration and model interpretability, transformer-based pre-trained language models have led to breakthroughs in single-cell biology research. Therefore, the systematic combining of continual learning and pre-trained language models for cell-type annotation tasks is inevitable. We herein propose a universal cell-type annotation tool, called CANAL, that continuously fine-tunes a pre-trained language model trained on a large amount of unlabeled scRNA-seq data, as new well-labeled data emerges. CANAL essentially alleviates the dilemma of catastrophic forgetting, both in terms of model inputs and outputs. For model inputs, we introduce an experience replay schema that repeatedly reviews previous vital examples in current training stages. This is achieved through a dynamic example bank with a fixed buffer size. The example bank is class-balanced and proficient in retaining cell-type-specific information, particularly facilitating the consolidation of patterns associated with rare cell types. For model outputs, we utilize representation knowledge distillation to regularize the divergence between previous and current models, resulting in the preservation of knowledge learned from past training stages. Moreover, our universal annotation framework considers the inclusion of new cell types throughout the fine-tuning and testing stages. We can continuously expand the cell-type annotation library by absorbing new cell types from newly arrived, well-annotated training datasets, as well as automatically identify novel cells in unlabeled datasets. Comprehensive experiments with data streams under various biological scenarios demonstrate the versatility and high model interpretability of CANAL. An implementation of CANAL is available from https://github.com/aster-ww/CANAL-torch. dengmh@pku.edu.cn. Supplementary data are available at Journal Name online.
- Research Article
24
- 10.1093/bib/bbac317
- Aug 1, 2022
- Briefings in Bioinformatics
Cell types (subpopulations) serve as bio-markers for the diagnosis and therapy of complex diseases, and single-cell RNA-sequencing (scRNA-seq) measures expression of genes at cell level, paving the way for the identification of cell types. Although great efforts have been devoted to this issue, it remains challenging to identify rare cell types in scRNA-seq data because of the few-shot problem, lack of interpretability and separation of generating samples and clustering of cells. To attack these issues, a novel deep generative model for leveraging the small samples of cells (aka scLDS2) is proposed by precisely estimating the distribution of different cells, which discriminate the rare and non-rare cell types with adversarial learning. Specifically, to enhance interpretability of samples, scLDS2 generates the sparse faked samples of cells with $\ell _1$-norm, where the relations among cells are learned, facilitating the identification of cell types. Furthermore, scLDS2 directly obtains cell types from the generated samples by learning the block structure such that cells belonging to the same types are similar to each other with the nuclear-norm. scLDS2 joins the generation of samples, classification of the generated and truth samples for cells and feature extraction into a unified generative framework, which transforms the rare cell types detection problem into a classification problem, paving the way for the identification of cell types with joint learning. The experimental results on 20 datasets demonstrate that scLDS2 significantly outperforms 17 state-of-the-art methods in terms of various measurements with 25.12% improvement in adjusted rand index on average, providing an effective strategy for scRNA-seq data with rare cell types. (The software is coded using python, and is freely available for academic https://github.com/xkmaxidian/scLDS2).
- Research Article
12
- 10.3390/biom12101539
- Oct 21, 2022
- Biomolecules
Recent advancement in single-cell RNA sequencing (scRNA-seq) technology is gaining more and more attention. Cell type annotation plays an essential role in scRNA-seq data analysis. Several computational methods have been proposed for automatic annotation. Traditional cell type annotation is to first cluster the cells using unsupervised learning methods based on the gene expression profiles, then to label the clusters using the aggregated cluster-level expression profiles and the marker genes’ information. Such procedure relies heavily on the clustering results. As the purity of clusters cannot be guaranteed, false detection of cluster features may lead to wrong annotations. In this paper, we improve this procedure and propose an Automatic Cell type Annotation Method (ACAM). ACAM delineates a clear framework to conduct automatic cell annotation through representative cluster identification, representative cluster annotation using marker genes, and the remaining cells’ classification. Experiments on seven real datasets show the better performance of ACAM compared to six well-known cell type annotation methods.
- Research Article
2
- 10.1101/2025.04.10.648034
- Apr 16, 2025
- bioRxiv
Background:The advancement of single cell technologies has driven significant progress in constructing a multiscale, pan-organ Human Reference Atlas (HRA) for healthy human cells, though challenges remain in harmonizing cell types and unifying nomenclature. Multiple machine learning and artificial intelligence methods, including pre-trained and fine-tuned models on large-scale atlas data, are publicly available for the single cell community users to computationally annotate and match their cell clusters to the reference atlas.Results:This study benchmarks four computational tools for cell type annotation and matching – Azimuth, CellTypist, scArches, and FR-Match – using two lung atlas datasets, the Human Lung Cell Atlas (HLCA) and the LungMAP single-cell reference (CellRef). Despite achieving high overall performance while comparing algorithmic cell type annotations to expert annotated data, variations in accuracy were observed, especially in annotating rare cell types, underlining the need for improved consistency across cell type prediction methods. The benchmarked methods were used to cross-compare and incrementally integrate 61 cell types from HLCA and 48 cell types from CellRef, resulting in a meta-atlas of 41 matched cell types, 20 HLCA-specific cell types, and 7 CellRef-specific cell types.Conclusion:This study reveals complementing strengths of the benchmarked methods and presents a framework for incremental growth of the cell type inventory in the reference atlases, leading to 68 unique cell types in the meta-atlas across CellRef and HLCA. The benchmarking analysis contributes to improving the coverage and quality of HRA construction by assessing the reliability and performance of cell type annotation approaches for single cell transcriptomics datasets.
- Research Article
17
- 10.1093/bib/bbad132
- Apr 20, 2023
- Briefings in Bioinformatics
Single-cell RNA sequencing (scRNA-seq) has significantly accelerated the experimental characterization of distinct cell lineages and types in complex tissues and organisms. Cell-type annotation is of great importance in most of the scRNA-seq analysis pipelines. However, manual cell-type annotation heavily relies on the quality of scRNA-seq data and marker genes, and therefore can be laborious and time-consuming. Furthermore, the heterogeneity of scRNA-seq datasets poses another challenge for accurate cell-type annotation, such as the batch effect induced by different scRNA-seq protocols and samples. To overcome these limitations, here we propose a novel pipeline, termed TripletCell, for cross-species, cross-protocol and cross-sample cell-type annotation. We developed a cell embedding and dimension-reduction module for the feature extraction (FE) in TripletCell, namely TripletCell-FE, to leverage the deep metric learning-based algorithm for the relationships between the reference gene expression matrix and the query cells. Our experimental studies on 21 datasets (covering nine scRNA-seq protocols, two species and three tissues) demonstrate that TripletCell outperformed state-of-the-art approaches for cell-type annotation. More importantly, regardless of protocols or species, TripletCell can deliver outstanding and robust performance in annotating different types of cells. TripletCell is freely available at https://github.com/liuyan3056/TripletCell. We believe that TripletCell is a reliable computational tool for accurately annotating various cell types using scRNA-seq data and will be instrumental in assisting the generation of novel biological hypotheses in cell biology.