Single-cell and spatial omics datasets can be organized and interpreted by annotating single cells to distinct types, states, locations or phenotypes. However, cell annotations are inherently ambiguous, as discrete labels with subjective interpretations are assigned to heterogeneous cell populations on the basis of noisy, sparse and high-dimensional data. Here we developed Annotatability, a framework for identifying annotation mismatches and characterizing biological data structure by monitoring the dynamics and difficulty of training a deep neural network over such annotated data. Following this, we developed a signal-aware graph embedding method that enables downstream analysis of biological signals. This embedding captures cellular communities associated with target signals. Using Annotatability, we address key challenges in the interpretation of genomic data, demonstrated over eight single-cell RNA sequencing and spatial omics datasets, including identifying erroneous annotations and intermediate cell states, delineating developmental or disease trajectories, and capturing cellular heterogeneity. These results underscore the broad applicability of annotation-trainability analysis via Annotatability for unraveling cellular diversity and interpreting collective cell behaviors in health and disease.
Read full abstract