Abstract

The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology provides insight into gene regulation and epigenetic heterogeneity at single-cell resolution, but cell annotation from scATAC-seq remains challenging due to high dimensionality and extreme sparsity within the data. Existing cell annotation methods mostly focus on the cell peak matrix without fully utilizing the underlying genomic sequence. Here we propose a method, SANGO, for accurate single-cell annotation by integrating genome sequences around the accessibility peaks within scATAC data. The genome sequences of peaks are encoded into low-dimensional embeddings, and then iteratively used to reconstruct the peak statistics of cells through a fully connected network. The learned weights are considered as regulatory modes to represent cells, and utilized to align the query cells and the annotated cells in the reference data through a graph transformer network for cell annotations. SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer. Moreover, from the annotated cells, we found cell-type-specific peaks that provide functional insights/biological signals through expression enrichment analysis, cis-regulatory chromatin interaction analysis and motif enrichment analysis.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call