Abstract

Abstract Recently, the National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) has generated harmonized genomic, transcriptomic, proteomic, and clinical data for >1000 tumors in 10 cohorts to facilitate pan-cancer discovery research. However, protein expression comparison across CPTAC cohorts remains challenging due to non-uniform missing data and protein expression distribution patterns across tumor types. Here we present our efforts to evaluate various missing data handling and normalization strategies to generate a normalized pan-cancer protein expression dataset. First, we developed a novel algorithm to select robustly expressed proteins in tumors in any of the CPTAC cohorts, Second, we applied a cohort hybrid imputation approach to protein abundance values from FragPipe within each cohort based on protein expression distribution patterns. Third, we calculated iBAQ using protein abundance value and applied global quantile normalization or smooth quantile normalization methods. To assess if our missing data imputation and normalization strategy affected downstream analyses, we compared the fold change in differential protein expression between tumor and matched normal for each cohort using non-normalized, global quantile normalized and smooth quantile normalized protein iBAQ values. Our results demonstrate a strong correlation in fold change between global quantile normalized data and non-normalized data (Pearson r = 0.97 (ccRCC), r = 0.96 (COAD), r = 0.99 (LUAD) and r = 0.99 (LSCC)). Similar results were observed comparing smooth quantile normalized data to non-normalized data (Pearson r = 1.00 for ccRCC, COAD, LUAD, and LSCC), indicating both normalization methods retained biological differences between tumor and matched normal tissues within cohorts. Lastly, we identified several proteins (ERAP2, CA9, GSTM3, MX1, STAT1) whose protein and RNA expression were highly correlated across eight CPTAC cohorts (r > 0.7 for COAD, BRCA, LUAD, ccRCC, PDAC, UCEC, HNSCC, and LSCC). We then compared their protein expression rank across CPTAC cohorts with their RNA expression rank across corresponding TCGA cohorts. Specifically, median log2(iBAQ) of CPTAC and median log2(TPM) of TCGA are calculated for those proteins within indications, then indications are ranked by median log2(iBAQ) and median log2(TPM) in CPTAC and TCGA, respectively. Weighted rank correlation was used to measure rank agreement. Global quantile normalization has the highest rank correlation (weighted rank correlation between 0.597 to 0.931) compared to smooth quantile normalization or without normalization. These results suggest that combination of cohort hybrid imputation and global quantile normalization is a reasonable approach to generate a normalized CPTAC pan-cancer protein dataset that could be leveraged to interrogate protein expression across different cancer types. Citation Format: Jixin Wang, Xiaowen Tian, Wen Yu, John Bullen, Elaine Hurt, Wenyan Zhong. Evaluating computational approaches for CPTAC pan-cancer cross-cohort protein expression comparison [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 2 (Late-Breaking, Clinical Trial, and Invited Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(7_Suppl):Abstract nr LB012.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call