Human papillomavirus and other predicting factors are responsible causing cervical cancer, and early prediction and diagnosis is the solution for preventing this condition. The objective is to find out and analyze the predictors of cervical cancer and to study the issues of unbalanced datasets using various Machine Learning (ML) algorithm-based models. A multi-stage sampling strategy was used to recruit 501 samples for the study. The educational intervention was the video-assisted counseling which is consisted of two educational methods: a documentary film and face-to- face interaction with women followed by reminders. Following the collection of baseline data from these subjects, they were encouraged to undergo Pap smear screening. Women having abnormal Pap tests were sent for biopsy. Machine learning classification methods such as Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), Multi-layer Perceptron (MLP) and Naive Bayes(NB) were used to evaluate the unbalanced input and target datasets. Merely 398 women out of 501 showed an interest to participate in the study, but only 298 stated a willingness for cervical screening. Atypical malignant cells were discovered on the cervix of 26 women who had abnormal pap tests. These women had guided for further tests, such as a cervical biopsy, and seven women had been diagnosed with cervical cancer. LR in models 1, 2, and 4 showed 88% to 94% sensitivity with 84% to 89% accuracy, respectively for cervical cancer prediction, whereas DT in models 3, 5, and 6 algorithms exhibited 83% to 84% sensitivity with 84% to 88% accuracy, respectively. The NB and LR algorithms produced the highest area under the ROC curve for testing dataset, but all models performed similarly for training data. In current study , Logistic Regression and Decision Tree algorithms were identified as the best-performed ML algorithm classifiers to detect the significant predictors.
Read full abstract