Abstract

BackgroundDespite the increasing number of studies in breast cancer survival prediction, there is little attention put toward deceased patients and their survival lengths. Moreover, developing a model that is both accurate and interpretable remains a challenge. ObjectiveThis paper proposes a two-stage data analytic framework, where Stage I classifies the survival and deceased statuses and Stage II predicts the number of survival months for deceased females with cancer. Since medical data are not entirely clean nor prepared for model development, we aim to show that data preparation can strengthen a simple Generalized Linear Model (GLM)11Generalized Linear Model. to predict as accurate as the complex models like Extreme Gradient Boosting (XGB)22Extreme Gradient Boosting. and Multilayer Perceptron based on Artificial Neural Networks (MLP-ANNs)33Multilayer Perceptron based on Artificial Neural Networks. in both stages. MethodsIn Stage I, we use recent Surveillance, Epidemiology, and End Results (SEER)44Surveillance, Epidemiology, and End Results. data from 2004 to 2016 to predict short term survival statuses from 6-months to 3-years with 6 month increments. Synthetic Minority Over-sampling Technique (SMOTE),55Synthetic Minority Over-sampling Technique. Relocating Safe-Level SMOTE (RSLS)66Relocating Safe-Level SMOTE., Adaptive Synthetic (ADASYN)77Adaptive Synthetic. re-sampling techniques, Least Absolute Shrinkage and Selection Operator (LASSO)88Least Absolute Shrinkage and Selection Operator. and Random Forest (RF)99Random Forest. feature selection methods along with integer and one-hot encoding are combined with the three popular data mining methods: GLM, XGB, and MLP. In Stage II, we predict the number of survival months for patients who are correctly predicted as deceased within 3-years. Again, we employ GLM, XGB, and MLP for regression along with LASSO and RF for feature selection and one-hot encoding to encode the categorical features. ResultsWe obtain Area Under the Receiver Operating Characteristic Curve (AUC)1010Area Under the Receiver Operating Characteristic Curve. values of 0.900, 0.898, 0.877, 0.852, 0.852, and 0.858 for 6-month, 1-, 1.5-, 2-, 2.5, and 3-year survival time-points, respectively, using OneHotEncoding-GLM-LASSO-ADASYN. We use the change in the Odds Ratio values in GLM to manifest the impact of individual categorical levels and numerical features on the odds of death. In Stage II, we obtain Mean Absolute Error (MAE)1111Mean Absolute Error. of 7.960 months using OneHotEncoding-GLM-LASSO when predicting the number of survival months for deceased patients. We present the top contributing features and their coefficient values to illustrate how the presence of each feature alters the predicted number of survival months. ConclusionTo the best of our knowledge, this is the first study that implements both breast cancer survival classification and regression in a two-stage approach. All data-driven findings are presented in order to assist clinicians make better care decisions using GLM, an interpretable and computationally efficient method that predicts survival status and survival lengths for deceased patients, to help foster human and machine interactions.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.