Abstract

Text mining techniques, particularly topic modeling, can be used for the automatic extraction of information from medical reports. The ability to autonomously analyze texts and identify topics within them can provide meaningful clinical insights that support physicians in diagnostic settings and enhance the characterization of intestinal diseases, leading to more efficient and automated systems.This study evaluates the effectiveness of Latent Dirichlet Allocation (LDA) and BERTopic in modeling topics from colonoscopy reports related to Crohn’s Disease, Ulcerative Colitis, and Polyps. We compared these models in terms of their ability to identify clinically relevant topics, their influence on the performance of machine learning classifiers trained on the derived topic features, and their scalability.Our analysis, based on average results across five iterations of train-test splits, showed that BERTopic generally outperformed LDA in clustering metrics, achieving Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Purity scores of 0.5637, 0.5953, and 0.8447, respectively, compared to LDA’s scores of 0.5349, 0.5254, and 0.8149. Additionally, classifiers trained on BERTopic-derived features exhibited improved predictive accuracy and F1-scores, with Logistic Regression reaching a mean accuracy of 0.8464 and a mean F1-score of 0.8507, compared to 0.8319 and 0.8351 for LDA-based features. Despite BERTopic’s overall superior performance, LDA demonstrated greater stability and interpretability, making it a viable option in scenarios where computational efficiency is a priority.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.