Abstract

Lung cancer is, one of the groups of malignant diseases affecting the Lung and associated organs. Pre-diagnosis is an important stage of identifying the target group of persons who can undergo diagnosis stage. In this study, a model is proposed based on ensemble of classifiers for prediction of lung cancer based on symptoms and risk factors. Data mining approach is adopted here, to develop model for system study. Data collection is carried out based on medically confirmed and diagnosed patient cases. Collected data is fed into data acceptance procedure for data outlier elimination, removal of insignificant data and noise. Data approved of the previous stage is pre-processed based on multi filter approach. Pre-processed data is then guided in to classifier algorithms which are rule, logic, conditional probability and neural network based approaches. Performance parameters and Confusion matrices are obtained for the individual algorithms based on both cross validation and Training set approaches. Based on the Reader Operator Characteristics (ROC) performance, error statistics and Confusion matrix, short listing of classification algorithms is carried out. It has been observed that training set based approach generally given better performance compared to cross validation approaches. Based on the error statistics, refinement process is carried out, thereby effectively bringing down the number of classifiers. From this study it has been observed that Sequential minimal optimization, Multi-Layer Perceptron, Instance based Learning on K-Nearest neighbor, Logistic, Random-Forest, Multiclass Classifier, Logit-Boost and Random Tree classifier algorithms have given consistent better performances Compared to others. Feature set extraction is then carried out based on Correlation Feature Selection (CFS) subset selection method under different search criteria, to reduce the dimension of the attributes... Feature set selection resulted in the reduction of dimensionality from 76 dimensions to. 20. An optimal model algorithm is developed by ensemble of classifier algorithms under supervised training approach. This models outcome class labels are validated only if all the prediction classifiers give the same consistent result. Some of the salient features observed in this study are: Unintentional weight loss, Pain in the parts of the body, Specific symptoms of Lung cancer [Coughing up blood (heamoptysis) or bloody mucus, Experience of Chest, shoulder, or back pain, Increase in volume of sputum, Wheezing problem, Shortness of breath] and risk factors like age at the time of diagnosis, Beedi smoking, consumption of country liquor/toddy, consumption of Brandy, Exposure to the sunlight for long duration and close relatives suffering with the caner played a major role in the prediction of the outcome class label.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call