Abstract

Lung cancer is one of the deadliest cancers in the world. Two of the most common subtypes, lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), have drastically different biological signatures, yet they are often treated similarly and classified together as non-small cell lung cancer (NSCLC). LUAD and LUSC biomarkers are scarce, and their distinct biological mechanisms have yet to be elucidated. To detect biologically relevant markers, many studies have attempted to improve traditional machine learning algorithms or develop novel algorithms for biomarker discovery. However, few have used overlapping machine learning or feature selection methods for cancer classification, biomarker identification, or gene expression analysis. This study proposes to use overlapping traditional feature selection or feature reduction techniques for cancer classification and biomarker discovery. The genes selected by the overlapping method were then verified using random forest. The classification statistics of the overlapping method were compared to those of the traditional feature selection methods. The identified biomarkers were validated in an external dataset using AUC and ROC analysis. Gene expression analysis was then performed to further investigate biological differences between LUAD and LUSC. Overall, our method achieved classification results comparable to, if not better than, the traditional algorithms. It also identified multiple known biomarkers, and five potentially novel biomarkers with high discriminating values between LUAD and LUSC. Many of the biomarkers also exhibit significant prognostic potential, particularly in LUAD. Our study also unraveled distinct biological pathways between LUAD and LUSC.

Highlights

  • Lung cancer is one of the deadliest cancers in the world

  • TPR True positive rate Xgboost Extreme gradient boosting QSOX1 Quiescin sulfhydryl oxidase 1 ARHGAP12 Rho GTPase activating protein 12 ARHGEF38 Rho guanine nucleotide exchange factor 38 ELFN2 Extracellular leucine rich repeat and fibronectin type III domain containing 2 MUC1 Mucin 1, cell surface associated GPC1 Glypican 1 GPC1 NECTIN1 Nectin cell adhesion molecule 1 PERP P53 apoptosis effector related to PMP22 REPS1 RALBP1 associated Eps domain containing 1 TRIM29 Tripartite motif containing 29 CELSR2 Cadherin EGF LAG seven-pass G-type receptor 2 TUBA1C Tubulin alpha 1c S100A2 S100 calcium binding protein A2 KRT5 Keratin 5 KRT14 Keratin 14 KRT6A Keratin 6A Tumor Protein P63 (TP63) Tumor protein P63 Napsin A Aspartic Peptidase (NAPSA) Napsin A aspartic peptidase MLPH Melanophilin DSC3 Desmocollin 3

  • We obtained lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) RNA-Seq data from TCGA1​3 and the summary of their clinical information was provided in Table 1, with more comprehensive details available on The Cancer Genome Atlas (TCGA) ­website[13]

Read more

Summary

Introduction

Lung cancer is one of the deadliest cancers in the world. Two of the most common subtypes, lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), have drastically different biological signatures, yet they are often treated and classified together as non-small cell lung cancer (NSCLC). Few have used overlapping machine learning or feature selection methods for cancer classification, biomarker identification, or gene expression analysis. We downloaded LUAD and LUSC RNA-Seq datasets from The Cancer Genome Atlas (TCGA)[13] and analyzed them with five feature selection methods with ranking abilities: Differential Gene Expression Analysis (DGE), Principal Component Analysis (PCA), Least absolute shrinkage and selection operator (Lasso), minimal-Redundancy-Maximal Relevance (mRMR), and Extreme Gradient boosting (XGboost). XGboost is a tree-based machine learning method that is not sensitive to outliers but is prone to o­ verfitting[17,18] To minimize this problem, we chose to use Lasso, a linear regression technique that avoids overfitting but can be influenced by highly correlated features and potentially leading to false ­discoveries. This study will serve as a proof of concept for the validity of the approach to overlap feature selection methods while investigating NSCLC subtype differences and discovering novel biomarkers

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.