Abstract

Non-small-cell lung cancer (NSCLC) is the most common type of lung cancer, which accounts for a proportion of nearly 85%. The increasing availability of genome-wide gene expression data has facilitated the identification of gene signatures that are significant to the precise classification of NSCLC subtypes and personalized treatment decisions. Unsupervised feature selection is an effective computational technique for searching the most discriminative feature subset to distinguish different classes and find the potential information embedded in biological data. In this study, we proposed a novel unsupervised feature selection method to identify the gene signatures for NSCLC subtype classification based on gene expression data. The proposed method incorporated linear discriminant analysis, adaptive structure preservation, and $l_{2,1}$ -norm sparse regression into a joint learning framework for unsupervised feature selection to select the informative genes. An effective algorithm was developed to solve the optimization problem in the proposed method. Furthermore, we performed module-based gene filtering before feature selection to reduce the computational cost. We evaluated the proposed method on a gene expression dataset of NSCLC from The Cancer Genome Atlas (TCGA). The experimental results show that the proposed method identified a small number of gene signatures for accurate NSCLC subtype classification. Enrichment analysis of the identified gene signatures was also performed by summarizing the key biological processes.

Highlights

  • Lung cancer which is a highly lethal malignant disease has become the leading cause of cancer-related death worldwide [1]

  • INITIAL FILTERING OF GENES To reduce the computational cost of the proposed feature selection method, we filtered out the genes that were less relevant to the two subtypes of non-small-cell lung cancer (NSCLC)

  • The increasing availability of genome-wide gene expression data has facilitated the identification of gene signatures for precise NSCLC subtypes classification

Read more

Summary

INTRODUCTION

Lung cancer which is a highly lethal malignant disease has become the leading cause of cancer-related death worldwide [1]. All the above feature ranking methods are supervised, which need class labels or related gene information to select the effective gene signatures for cancer subtype classification. A family of methods has been developed to maintain the underlying data structure in the embedded learning processes [24] These important structures include the global structure [25], [26], the local structure [27], [28], and the discriminative information [29], [30]. To select the effective and precise gene signatures for the NSCLC subtype classification, we proposed a novel unsupervised feature selection method which maintains the important data structure by using only the selected features. We performed an enrichment analysis of the selected gene signatures by summarizing the key biological processes

NOTATIONS
LINEAR DISCRIMINANT ANALYSIS
ADAPTIVE STRUCTURE PRESERVATION
PROPOSED METHOD
RESULTS
1) EVALUATION METRICS
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call