Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.

Nathan J Leroy,Nathan J Leroy,Jason P Smith,Jason P Smith,Jason P Smith,Guangtao Zheng,Julia Rymuza,Erfaneh Gharavi,Erfaneh Gharavi,Donald E Brown,Donald E Brown,Aidong Zhang,Aidong Zhang,Aidong Zhang,Nathan C Sheffield,Nathan C Sheffield,Nathan C Sheffield,Nathan C Sheffield,Nathan C Sheffield,Nathan C Sheffield,Nathan C Sheffield

doi:10.1093/nargab/lqae073

Abstract

Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: NAR genomics and bioinformatics	Publication Date: Sep 1, 2024
Citations: 3	License type: cc-by

R Discovery Prime

R Discovery Prime

Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.

Abstract

Talk to us

Similar Papers

More From: NAR genomics and bioinformatics

Lead the way for us

Similar Papers

Neural Machine Translation for Kashmiri to English and Hindi using Pre-trained Embeddings
Shailashree K Sheshadri ... Deepa Gupta
-
Shailashree K Sheshadri, et. al.Shailashree K Sheshadri ... Deepa Gupta
01 Dec 2022
01 Dec 2022

WhisPAr: Transferring pre-trained audio models to fine-grained classification via Prompt and Adapter
Bin Shi ... Meng Zhao
Knowledge-Based Systems | VOL. 300
Bin Shi, et. al.Bin Shi ... Meng Zhao
09 Jul 2024
Knowledge-Based Systems | VOL. 300

Analysis of representation and generalization capabilities of pre-trained audio models in urban environments
Daniele Atzeni ... Ester Vidaña-Vila
INTER-NOISE and NOISE-CON Congress and Conference Proceedings | VOL. 270
Daniele Atzeni, et. al.Daniele Atzeni ... Ester Vidaña-Vila
04 Oct 2024
INTER-NOISE and NOISE-CON Congress and Conference Proceedings | VOL. 270

Transfer Learning and Fine-Tuning for Deep Learning-Based Tea Diseases Detection on Small Datasets
Ade Ramdan ... Hilman F Pardede
-
Ade Ramdan, et. al.Ade Ramdan ... Hilman F Pardede
18 Nov 2020
18 Nov 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.

Abstract

Talk to us

Similar Papers

More From: NAR genomics and bioinformatics