Abstract

Rapid distinction between small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC) tumors is very important in diagnosis of this disease. Furthermore sequence-derived structural and physicochemical descriptors are very useful for machine learning prediction of protein structural and functional classes, classifying proteins and the prediction performance. Herein, in this study is the classification of lung tumors based on 1497 attributes derived from structural and physicochemical properties of protein sequences (based on genes defined by microarray analysis) investigated through a combination of attribute weighting, supervised and unsupervised clustering algorithms. Eighty percent of the weighting methods selected features such as autocorrelation, dipeptide composition and distribution of hydrophobicity as the most important protein attributes in classification of SCLC, NSCLC and COMMON classes of lung tumors. The same results were observed by most tree induction algorithms while descriptors of hydrophobicity distribution were high in protein sequences COMMON in both groups and distribution of charge in these proteins was very low; showing COMMON proteins were very hydrophobic. Furthermore, compositions of polar dipeptide in SCLC proteins were higher than NSCLC proteins. Some clustering models (alone or in combination with attribute weighting algorithms) were able to nearly classify SCLC and NSCLC proteins. Random Forest tree induction algorithm, calculated on leaves one-out and 10-fold cross validation) shows more than 86% accuracy in clustering and predicting three different lung cancer tumors. Here for the first time the application of data mining tools to effectively classify three classes of lung cancer tumors regarding the importance of dipeptide composition, autocorrelation and distribution descriptor has been reported.

Highlights

  • Lung cancer is a leading cause of deaths from cancer worldwide

  • Non-small cell lung cancer (NSCLC) affects about 80% of patients and, when diagnosed at a localized stage, the 5-year survival is about 50%, whereas it decreases to 8% and 3% in the case of lymph node involvement or metastasis, respectively [1]

  • 59 records were classified as SCLC class, 30 records belonged to non-small cell lung cancer (NSCLC) class and 25 records were classified as COMMON class

Read more

Summary

Introduction

Non-small cell lung cancer (NSCLC) affects about 80% of patients and, when diagnosed at a localized stage, the 5-year survival is about 50%, whereas it decreases to 8% and 3% in the case of lymph node involvement or metastasis, respectively [1]. Patients with non-small cell lung tumors (squamous, AC, and large cell) are treated differently from those with small cell tumors, pathological distinction between these two types of lung tumor is very important. Non-small cell lung cancer (NSCLC) is the leading cause of cancer mortality worldwide. Identifying a useful prognostic biologic and molecular marker is important to evaluate the biologic and molecular characteristics that differed from tumor, lymph node, metastasis TNM staging in non-small cell lung cancer (NSCLC) in order to predict prognosis and establish preventive methods [7]. A better understanding of the molecular pathogenesis of SCLC would likely suggest strategies for earlier diagnosis and new molecular-targeted therapies [8]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call