This study introduces a sophisticated predictive model integrating clinical and lifestyle data addressing the critical public health challenge of cervical cancer, particularly in regions lacking routine screenings. Leveraging data driven analytics, the proposed model undergoes comprehensive preprocessing, including exploratory data analysis, missing value imputation, and feature extraction. Feature selection is carried out using the XGBoost classifier to ensure model efficacy. Data normalization and class balance via oversampling techniques are applied, with model validation conducted through stratified cross-validation. The optimized feature vector is then employed to train a LightGBM model. Utilizing a retrospective dataset of 858 patients from the Hospital Universitario de Caracas, Venezuela, comprising demographic, lifestyle, and medical history data, the LightGBM model achieves an impressive accuracy of 98%, outperforming similar existing approaches. The study outcome demonstrates the effectiveness of the proposed data modelling framework and feature selection, along with the choice of LightGBM as a suitable classifier. The proposed predictive framework can efficiently aid healthcare professionals in prioritizing high-risk patients for further evaluation and intervention.
Read full abstract