This study aimed to identify diagnostic gene biomarkers for colorectal cancer (CRC) by analyzing differentially expressed genes (DEGs) in tumor and adjacent normal samples across five colon cancer gene-expression profiles (GSE10950, GSE25070, GSE41328, GSE74602, GSE142279) from the Gene Expression Omnibus (GEO) database. Intersecting identified DEGs with the module with the highest correlation to gene expression patterns of tumor samples in the gene co-expression network analysis revealed 283 overlapped genes. Centrality measures were calculated for these genes in the reconstructed STRING protein-protein interaction network. Applying LASSO logistic regression, eleven genes were ultimately recognized as candidate diagnostic genes. Among these genes, the area under the receiver operating characteristic curve (AUROC) values for nine genes (CDC25B, CDK4, IQGAP3, MMP1, MMP7, SLC7A5, TEAD4, TRIB3, and UHRF1) surpassed the threshold of 0.92 in both the training and validation sets. We evaluated the diagnostic performance of these genes with four machine learning algorithms: random forest (RF), support vector machines (SVM), artificial neural network (ANN), and gradient boosting machine (GBM). In the testing dataset (GSE21815 and GSE106582), the AUROC scores were greater than 0.95 for all of the machine learning algorithms, indicating the high diagnostic performance of the nine genes. Besides, these nine genes are also significantly correlated to twelve immune cells, namely Mast cells activated, Macrophages M0, M1, and M2, Neutrophils, T cells CD4 memory activated, T cells follicular helper, T cells CD8, T cells CD4 memory resting, B cells memory, Plasma cells, and Mast cells resting (P < 0.05). These results strongly suggest that all of the nine genes have the potential to serve as reliable diagnostic biomarkers for CRC.
Read full abstract