Abstract. Lung cancer is one of the deadliest forms of cancer, claiming numerous lives annually. Thoracic surgery is a strategy to manage lung cancer patients; however, it poses high risks, including potential nerve damage and fatal complications leading to mortality. Predicting the success rate of thoracic surgery for lung cancer patients can be accomplished using data mining techniques based on classification principles. Medical data mining involves employing mathematical, statistical, and computational methods. In this study, the prediction of thoracic surgery success employs the random forest algorithm with recursive feature elimination for feature selection. The feature selection process yields the top 8 features. The 8 best features include 'DGN', 'PRE4', 'PRE5', 'PRE6', 'PRE10', 'PRE14', 'PRE30', and 'AGE'. Hyperparameter using GridSearchCV is then applied to enhance classification accuracy. The results of this method implementation demonstrate a predictive accuracy of 91.41%. Purpose: The study aims to develop and evaluate a Random Forest model with a Recursive Feature Elimination feature selection and applies hyperparameter GridSearchCV for predicting thoracic surgery success rate. Methods: This study uses the thoracic surgery dataset and applies various data preprocessing techniques. The dataset is then used for classification using the Random Forest algorithm and applies the Recursive Feature Elimination feature selection to obtain the best features. GridSearchCV is used in this study for hyperparameter. Result: The accuracy using the Random Forest algorithm and Recursive Feature Elimination feature selection with hyperparameters tuning GridSearchCV resulted in an accuracy of 91,41%. The accuracy was obtained from the following parameters values: bootstrap set to false, criterion set to gini, n_estimator equal to 100, max_depth set to none, min_samples_split equal to 4, min_samples_leaf equal to 1, max_features set to auto, n_jobs set to -1, and verbose set to 2 with 10-fold cross validation. Novelty: This study comparison and analysis of various dataset preprocessing methods and different model configurations are conducted to find the best model for predicting the success rate of thoracic surgery. The study also employs feature selection to choose the best feature to be used in classification process, as well as hyperparameter tuning to achieve optimal accuracy and discover the optimal values for these hyperparameters.
Read full abstract