BackgroundThe prognosis of non-small cell lung cancer (NSCLC) is substantially affected by lymph node metastasis (LNM), but there are no noninvasive, inexpensive methods of relatively high accuracy available to predict LNM in NSCLC patients.MethodsClinical data on NSCLC patients were obtained from the Surveillance, Epidemiology, and End Results (SEER) database. Risk factors for LNM were recognized LASSO and multivariate logistic regression. Six predictive models were constructed with machine learning based on risk factors. The area under the receiver operating characteristic curve (AUC) was used to assess the performance of the model. Subgroup analysis with different T-stages was performed on an optimal model. A webpage LNM risk calculator for optimal model was built using the Shinyapps.io platform.ResultsWe enrolled 64,012 NSCLC patients, of whom 26,611 (41.57%) had LNM. Using multivariate logistic regression, we finally identified 10 independent risk factors for LNM: age, sex, race, histology, primary site, grade, T stage, M stage, tumor size, and bone metastases. GLM is the optimal model among all six machine learning models in both the training and validation cohorts. Subgroup analyses revealed that GLM has good predictability for populations with different T staging. A webpage LNM risk calculator based on GLM was posted on the shinyapps.io platform (https://wubopredict.shinyapps.io/dynnomapp/).ConclusionThe predictive model based on GLM can be used to precisely predict the probability of LNM in NSCLC patients, which was proven effective in all subgroup analyses according to T staging.