ABSTRACTGlobally, cancer is the second‐leading cause of death after cardiovascular disease. To improve survival rates, risk factors and cancer predictors must be identified early. From the literature, researchers have developed several kinds of machine learning‐based diagnostic systems for early cancer prediction. This study presented a diagnostic system that can identify the risk factors linked to the onset of cancer in order to anticipate cancer early. The newly constructed diagnostic system consists of two modules: the first module relies on a statistical F‐score method to rank the variables in the dataset, and the second module deploys the random forest (RF) model for classification. Using a genetic algorithm, the hyperparameters of the RF model were optimized for improved accuracy. A dataset including 10 765 samples with 74 variables per sample was gathered from the Swedish National Study on Aging and Care (SNAC). The acquired dataset has a bias issue due to the extreme imbalance between the classes. In order to address this issue and prevent bias in the newly constructed model, we balanced the classes using a random undersampling strategy. The model's components are integrated into a single unit called F‐RUS‐RF. With a sensitivity of 92.25% and a specificity of 85.14%, the F‐RUS‐RF model achieved the highest accuracy of 86.15%, utilizing only six highly ranked variables according to the statistical F‐score approach. We can lower the incidence of cancer in the aging population by addressing the risk factors for cancer that the F‐RUS‐RF model found.
Read full abstract