Abstract

Early diagnosis of lung cancers and distinction between the tumor types (Small Cell Lung Cancer (SCLC) and Non-Small Cell Lung Cancer (NSCLC) are very important to increase the survival rate of patients. Herein, we propose a diagnostic system based on sequence-derived structural and physicochemical attributes of proteins that involved in both types of tumors via feature extraction, feature selection and prediction models. 1497 proteins attributes computed and important features selected by 12 attribute weighting models and finally machine learning models consist of seven SVM models, three ANN models and two NB models applied on original database and newly created ones from attribute weighting models; models accuracies calculated through 10-fold cross and wrapper validation (just for SVM algorithms). In line with our previous findings, dipeptide composition, autocorrelation and distribution descriptor were the most important protein features selected by bioinformatics tools. The algorithms performances in lung cancer tumor type prediction increased when they applied on datasets created by attribute weighting models rather than original dataset. Wrapper-Validation performed better than X-Validation; the best cancer type prediction resulted from SVM and SVM Linear models (82%). The best accuracy of ANN gained when Neural Net model applied on SVM dataset (88%). This is the first report suggesting that the combination of protein features and attribute weighting models with machine learning algorithms can be effectively used to predict the type of lung cancer tumors (SCLC and NSCLC).Electronic supplementary materialThe online version of this article (doi:10.1186/2193-1801-2-238) contains supplementary material, which is available to authorized users.

Highlights

  • Lung cancer, as a leading cause of death worldwide, starts from the lungs and may spreads to other organs of the body and has a low survival rate of just 15% (Ganesan et al 2010a, 2010b, Nomori 2011)

  • Data cleaning In original dataset, 59 records classified as SCLC, 30 records belonged to NSCLC class and 25 other records to COMMON tumor classes

  • For each record 1497 features computed and after removing duplicate, useless and correlated attributes, the number of protein features for each record decreased to 1089 features and this cleaned dataset named as Final Cleaned database (FCdb)

Read more

Summary

Introduction

As a leading cause of death worldwide, starts from the lungs and may spreads to other organs of the body and has a low survival rate of just 15% (Ganesan et al 2010a, 2010b, Nomori 2011). Many different techniques such as Chest Radiograph (x-ray), Computed Tomography (CT), Magnetic Resonance Imaging (MRI) and Sputum Cytology have been used for lung cancer classification (Grondin and Liptay 2002, Schaefer-Prokop and Prokop 2002). Most of these techniques are either expensive and time consuming or applicable only in the advanced stages, when the survival rate of patients is very limited (Fatma et al 2012). In the diagnostic systems of lung cancer with computer-aided, the rate of false negative identification should be kept as low as possible to improve the rate of overall identification on the highest possible rate (Zhou et al 2002)

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call