Abstract Background Early identification of pediatric inflammatory bowel disease (IBD) can improve long-term prognosis, yet significant diagnostic delays persist. This study aims to develop an effective early identification tool using machine learning with non-invasive testing. Methods This retrospective study collected data from 314 pediatric patients (109 with IBD and 205 without IBD) hospitalized in Fudan University Children’s Hospital, to serve as the training dataset. We developed machine learning classifiers using support vector machine, artificial neural network, extreme gradient boosting, decision tree, random forest, k-nearest neighbors, logistic model, and gradient boosting machine methods, incorporating easily obtainable clinical examination tools. Participant data included age, sex, and three groups of features across different dimensions: IBD symptoms (diarrhea lasting ≥1 month; blood in stool for ≥1 week; recurrent perianal abscesses or fistulas; delayed growth; abdominal pain lasting ≥1 month; >10% weight loss; first-degree family history; arthritis, uveitis, erythema nodosum without definitive rheumatologic diagnosis; recurrent aphthous ulcers; unexplained fever), inflammatory biomarkers (elevated fecal calprotectin; elevated serum inflammatory markers; anemia; hypoalbuminemia; ANCA positivity), and transabdominal bowel ultrasound parameters (Limberg level >1 indicating bowel wall thickening with vascularity; mesenteric fat wrapping; bowel wall stratification disorder or loss; lymphadenopathy). We evaluated the performance of each feature group and their combinations, optimized hyperparameters, internal validation, and performed sequential validation with 66 prospectively recruited pediatric patients from the same center. The final model was selected based on optimal and stable performance in both the development and external validation cohorts, with Shapley values used to interpret feature importance. Results The support vector machine outperformed other algorithms, achieving an area under the curve of 0.95 in both internal and external validation datasets. The five most important features based on Shapley values were Limberg level >1, elevated ESR, elevated fecal calprotectin, recurrent perianal disease, and elevated CRP. Additionally, combining all three feature dimensions outperformed any single or paired combination. The probability of IBD classification showed moderate correlation with clinical, endoscopic, and histologic scores . Conclusion The machine learning model based on non-invasive data has the potential to become a low-cost and time-efficient tool for early identification of IBD. Ultrasound examination plays a significant role in distinguishing children with IBD from those without.
Read full abstract