Abstract

BackgroundWhen considering therapies for lung adenocarcinoma (LUAD) patients, the carcinogenic mechanisms of smokers are believed to differ from those who have never smoked. The rising trend in the proportion of nonsmokers in LUAD urgently requires the understanding of such differences at a molecular level for the development of precision medicine.MethodsThree independent LUAD tumor sample sets—TCGA, SPORE and EDRN—were used. Genome patterns of expression (GE), copy number variation (CNV) and methylation (ME) were reviewed to discover the differences between them for both smokers and nonsmokers. Tobacco-related signature genes distinguishing these two groups of LUAD were identified using the GE, ME and CNV values of the whole genome. To do this, a novel iterative multi-step selection method based on the partial least squares (PLS) algorithm was proposed to overcome the high variable dimension and high noise inherent in the data. This method can thoroughly evaluate the importance of genes according to their statistical differences, biological functions and contributions to the tobacco exposure classification model. The kernel partial least squares (KPLS) method was used to further optimize the accuracies of the classification models.ResultsForty-three, forty-eight and seventy-five genes were identified as GE, ME and CNV signatures, respectively, to distinguish smokers from nonsmokers. Using only the gene expression values of these 43 GE signature genes, ME values of the 48 ME signature genes or copy numbers of the 75 CNV signature genes, the accuracies of TCGA training and SPORE/EDRN independent validation datasets all exceed 76%. More importantly, the focal amplicon in Telomerase Reverse Transcriptase in nonsmokers, the broad deletion in ChrY in male nonsmokers and the greater amplification of MDM2 in female nonsmokers may explain why nonsmokers of both genders tend to suffer LUAD. These pattern analysis results may have clear biological interpretation in the molecular mechanism of tumorigenesis. Meanwhile, the identified signature genes may serve as potential drug targets for the precision medicine of LUAD.

Highlights

  • Lung cancer is currently the most common malignant disease and the leading cause of mortality in the world (Siegel, Miller & Jemal, 2018; Tian, 2019)

  • Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma are the two major histological types of non-small cell lung cancer (NSCLC) covering over 90% of cases; NSCLC itself accounts for 75–80% of lung cancer cases (Qiu et al, 2017; Chen et al, 2015; Tian, 2015)

  • Datasets Three independent lung adenocarcinoma (LUAD) datasets were used in our study as follows: TCGA data The Cancer Genome Atlas (TCGA) project (Legacy GDC) data is a public dataset hosted by TCGA

Read more

Summary

Introduction

Lung cancer is currently the most common malignant disease and the leading cause of mortality in the world (Siegel, Miller & Jemal, 2018; Tian, 2019). A novel iterative multi-step selection method based on the partial least squares (PLS) algorithm was proposed to overcome the high variable dimension and high noise inherent in the data This method can thoroughly evaluate the importance of genes according to their statistical differences, biological functions and contributions to the tobacco exposure classification model. The focal amplicon in Telomerase Reverse Transcriptase in nonsmokers, the broad deletion in ChrY in male nonsmokers and the greater amplification of MDM2 in female nonsmokers may explain why nonsmokers of both genders tend to suffer LUAD. These pattern analysis results may have clear biological interpretation in the molecular mechanism of tumorigenesis. The identified signature genes may serve as potential drug targets for the precision medicine of LUAD

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call