Abstract

This study hopes to contribute to disease detection by analyzing a medical examination dataset with 123,968 samples. Based on association rules mining and related medical knowledge, 6 models were constructed here to predict hyperuricemia prevalence and investigated its risk factors. Comparing different models, the prediction performances of Lasso logistic regression, traditional logistic regression, and random forest are excellent, and the results can be interpreted. PCA logistic regression model also works well, but it is not analytical. KNN’s prediction performance is relatively poor, while data dimensionality reduction can significantly improve its AUC. SVC has the worst performance and its efficiency of processing highdimensional large dataset is extremely low. The risk factors of hyperuricemia mainly belongs to 4 categories, which are obesity-related factors, renal function factors, liver function factors, and myeloproliferative diseases-related factors. Random forest, Lasso regression, and logistic regression all treat serum creatinine, BMI, triglyceride, fatty liver, and age as key predictive variables. Models also show that serum urea, serum alanine aminotransferase, negative urobilinogen, red blood cell count, white blood cell count and the pH are significantly correlated with the risk.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call