BackgroundAlthough papillary thyroid cancer (PTC) patients are known to have an excellent prognosis, up to 30% of patients experience disease recurrence after initial treatment. Accurately predicting disease prognosis remains a challenge given that the predictive value of several predictors remains controversial. Thus, we investigated whether machine learning (ML) approaches based on comprehensive predictors can predict the risk of structural recurrence for PTC patients.MethodsA total of 2244 patients treated with thyroid surgery and radioiodine were included. Twenty-nine perioperative variables consisting of four dimensions (demographic characteristics and comorbidities, tumor-related variables, lymph node (LN)-related variables, and metabolic and inflammatory markers) were analyzed. We applied five ML algorithms—logistic regression (LR), support vector machine (SVM), extreme gradient boosting (XGBoost), random forest (RF), and neural network (NN)—to develop the models. The area under the receiver operating characteristic (AUC-ROC) curve, calibration curve, and variable importance were used to evaluate the models’ performance.ResultsDuring a median follow-up of 45.5 months, 179 patients (8.0%) experienced structural recurrence. The non-stimulated thyroglobulin, LN dissection, number of LNs dissected, lymph node metastasis ratio, N stage, comorbidity of hypertension, comorbidity of diabetes, body mass index, and low-density lipoprotein were used to develop the models. All models showed a greater AUC (AUC = 0.738 to 0.767) than did the ATA risk stratification (AUC = 0.620, DeLong test: P < 0.01). The SVM, XGBoost, and RF model showed greater sensitivity (0.568, 0.595, 0.676), specificity (0.903, 0.857, 0.784), accuracy (0.875, 0.835, 0.775), positive predictive value (PPV) (0.344, 0.272, 0.219), negative predictive value (NPV) (0.959, 0.959, 0.964), and F1 score (0.429, 0.373, 0.331) than did the ATA risk stratification (sensitivity = 0.432, specificity = 0.770, accuracy = 0.742, PPV = 0.144, NPV = 0.938, F1 score = 0.216). The RF model had generally consistent calibration compared with the other models. The Tg and the LNR were the top 2 important variables in all the models, the N stage was the top 5 important variables in all the models.ConclusionsThe RF model achieved the expected prediction performance with generally good discrimination, calibration and interpretability in this study. This study sheds light on the potential of ML approaches for improving the accuracy of risk stratification for PTC patients.Trial registrationRetrospectively registered at www.chictr.org.cn (trial registration number: ChiCTR2300075574, date of registration: 2023-09-08).