Abstract 878: Enhancing single-cell RNA sequencing analysis in cancer research: A machine learning framework based on LightGBM for automated cell type annotation

Tsung Hsien Chuang,Tzu-Pin Lu,Hsiang-Han Chen,Mong-Hsun Tsai,Eric Y Chuang,Liang-Chuan Lai

doi:10.1158/1538-7445.am2024-878

Tsung Hsien Chuang, Tzu-Pin Lu + Show 4 more

https://doi.org/10.1158/1538-7445.am2024-878

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Abstract Single-cell RNA sequencing (scRNA-seq) has been widely used in cancer research to understand the complex gene expression diversity and cancer heterogeneity. However, manual annotation of cell types in the scRNA-seq pipeline is time-consuming and depends on the expertise of analyzers, which can significantly influence the results of downstream analyses. To address this problem, we proposed a novel machine learning framework utilizing the LightGBM model for automated and efficient cell-type annotation of scRNA-seq. Two independent scRNA-seq datasets of non-small cell lung cancer (NSCLC) downloaded from the Gene Expression Omnibus (GEO) were used to train and test our model. A standard procedure is applied to both scRNA-seq datasets for quality control and preprocessing, in which poor-quality cells with low gene expressions or high scores for cellular stress/death were excluded. In addition, Harmony is applied to mitigate batch effects in scRNA-seq that could cause variability due to non-biological factors in experiments. Nine different cell types, including endothelial, epithelial, fibroblast, macrophages, mast, plasma, pulmonary alveolar, B, and T cells, were manually labeled in the two datasets by the providers, which were also examined using gene markers corresponding to different cell types from PanglaoDB and DAVID. These manually labeled cell types were used as the ground truth for training and testing our model. In the training stage, the training dataset (containing 85,000 cells from 44 NSCLC samples) of scRNA-seq was used to train the LightGBM model with its high-variable genes. Then, the model would be evaluated using an independent test dataset (containing 8,000 cells from 18 NSCLC samples) by comparing the automatically predicted and manually labeled cell types. The training result showed that our model could successfully specify the nine different cell types, achieving an overall average accuracy, F1 score, and precision of 0.86 each respectively. In the independent dataset test, the model demonstrated good generalization, showing high predictive performance across all cell types, with an average accuracy, F1 score, and precision of 0.8, 0.78, and 0.8, respectively. Specific to the predictions in the test dataset, we found that some epithelial cells were mistakenly identified as other cell types. This might be because of the complex gene expression patterns exhibited by tumor epithelial cells, making accurate predictions challenging. The proposed machine learning framework facilitates cell labeling and unravels the intricate heterogeneity within lung cancer datasets. The combination of LightGBM and standardized preprocessing establishes a benchmark for high-throughput, accurate single-cell analysis, paving the way for discoveries that are more targeted and have significant clinical impact. Citation Format: Tsung Hsien Chuang, Liang-Chuan Lai, Tzu-Pin Lu, Mong-Hsun Tsai, Hsiang-Han Chen, Eric Y. Chuang. Enhancing single-cell RNA sequencing analysis in cancer research: A machine learning framework based on LightGBM for automated cell type annotation [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 878.

Full Text