Abstract Single-cell technologies represent a revolutionary approach to resolving cell-type heterogeneity, identifying cells in specialized states, and detecting rare disease-associated cells. With the cost of single-cell technology decreasing substantially, its integration into clinical studies is gaining momentum. A new computational tool is needed to accommodate different single-cell genomics and clinical data formats while accounting for unwanted confounders. The study aims to develop a tree-based machine learning model to leverage the unprecedented resolution of single-cell multi-omics data for delineating the genomic and phenotypic drivers behind diverse immunotherapy responses. The proposed model is called single-cell analysis of Clinical Tree (scanCT), inspired by the Generalized Unbiased Interaction Detection and Estimation method for unbiased gene and protein feature selection and easy interpretation. The scanCT model learns from the data to select the genomic feature that best splits the cells from distinct clinical responses for each tree node. The confounding factors will be regressors in the nodes but not be used for branch splitting, while gene and protein features of interest will split the tree but not enter the regression model in each node. scanCT is built to be free from the biased selection towards variables of a larger number of categories or values. With tree-pruning and cross-validation, scanCT overcomes the over-fitting issue and enhances model generalization, especially for clinical studies with limited patients. Particularly, scanCT naturally fits the hierarchical cell type relationship and handles marker gene and protein interaction effects efficiently. Our approach was tested on single-cell datasets from B-cell malignancy patients undergoing Chimeric Antigen Receptor (CAR)-T cell therapy. The results from the scanCT are highly interpretable. For instance, each branch is a gene-protein combination profile, and cells are naturally partitioned by clinical association. The linear regressions at each leaf node are the clinical predictions for cells following the splitting criteria. The regression intercept is an average estimation of toxicity (e.g., neurotoxicity) or efficacy after controlling for confounder (e.g., tumor burden). scanCT accommodates categorical or continuous clinical response and survival data and is robust to missing values, a frequent challenge in oncological studies. scanCT represents a significant step forward in single-cell data analysis, which merges complex genotypic and phenotypic information with clinical outcomes. The efficacy and toxicity-associated genomic signatures will inform new manufacturing strategies to optimize CAR-T cell therapy products. The model and clinical association detections are expected to go beyond the B-cell malignancy field to benefit the broader cancer research community. Citation Format: Ye Zheng, Long Nguyen, Peigen Zhou, Alexandre V. Hirayama. ScanCT: A tree-based machine learning model to detect single-cell genomic features associated with clinical outcomes [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 7352.
Read full abstract