Continually adapting pre-trained language model to universal annotation of single-cell RNA-seq data.

Hui Wan,Minghua Deng,Musu Yuan,Yiwei Fu

doi:10.1093/bib/bbae047

Abstract

Cell-type annotation of single-cell RNA-sequencing (scRNA-seq) data is a hallmark of biomedical research and clinical application. Current annotation tools usually assume the simultaneous acquisition of well-annotated data, but without the ability to expand knowledge from new data. Yet, such tools are inconsistent with the continuous emergence of scRNA-seq data, calling for a continuous cell-type annotation model. In addition, by their powerful ability of information integration and model interpretability, transformer-based pre-trained language models have led to breakthroughs in single-cell biology research. Therefore, the systematic combining of continual learning and pre-trained language models for cell-type annotation tasks is inevitable. We herein propose a universal cell-type annotation tool, called CANAL, that continuously fine-tunes a pre-trained language model trained on a large amount of unlabeled scRNA-seq data, as new well-labeled data emerges. CANAL essentially alleviates the dilemma of catastrophic forgetting, both in terms of model inputs and outputs. For model inputs, we introduce an experience replay schema that repeatedly reviews previous vital examples in current training stages. This is achieved through a dynamic example bank with a fixed buffer size. The example bank is class-balanced and proficient in retaining cell-type-specific information, particularly facilitating the consolidation of patterns associated with rare cell types. For model outputs, we utilize representation knowledge distillation to regularize the divergence between previous and current models, resulting in the preservation of knowledge learned from past training stages. Moreover, our universal annotation framework considers the inclusion of new cell types throughout the fine-tuning and testing stages. We can continuously expand the cell-type annotation library by absorbing new cell types from newly arrived, well-annotated training datasets, as well as automatically identify novel cells in unlabeled datasets. Comprehensive experiments with data streams under various biological scenarios demonstrate the versatility and high model interpretability of CANAL. An implementation of CANAL is available from https://github.com/aster-ww/CANAL-torch. dengmh@pku.edu.cn. Supplementary data are available at Journal Name online.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Continually adapting pre-trained language model to universal annotation of single-cell RNA-seq data.

Abstract

Talk to us

Similar Papers

More From: Briefings in bioinformatics

Lead the way for us

Journal: Briefings in bioinformatics	Publication Date: Jan 22, 2024
License type: CC BY-NC 4.0

Similar Papers

ScBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data
Fan Yang ... Junzhou Huang
Nature Machine Intelligence | VOL. 4
Fan Yang, et. al.Fan Yang ... Junzhou Huang
26 Sep 2022
Nature Machine Intelligence | VOL. 4

Assessing parameter efficient methods for pre-trained language model in annotating scRNA-seq data
Yucheng Xia ... Wenyi Ge
Methods | VOL. 228
Yucheng Xia, et. al.Yucheng Xia ... Wenyi Ge
15 May 2024
Methods | VOL. 228

ScSwinTNet: A Cell Type Annotation Method for Large-Scale Single-Cell RNA-Seq Data Based on Shifted Window Attention.
Huanhuan Dai ... Xun Wang
IEEE journal of biomedical and health informatics | VOL. PP
Huanhuan Dai, et. al.Huanhuan Dai ... Xun Wang
01 Jan 2024
IEEE journal of biomedical and health informatics | VOL. PP

Identification of kidney cell types in scRNA-seq and snRNA-seq data using machine learning algorithms
Adam Tisch ... Sylvia E Rosas
Heliyon | VOL. 10
Adam Tisch, et. al.Adam Tisch ... Sylvia E Rosas
27 Sep 2024
Heliyon | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Continually adapting pre-trained language model to universal annotation of single-cell RNA-seq data.

Abstract

Talk to us

Similar Papers

More From: Briefings in bioinformatics