Machine learning-based algorithms applied to drug prescriptions and other healthcare services in the Sicilian claims database to identify acromegaly as a model for the earlier diagnosis of rare diseases

Carlo Combi,Maria Cristina De Martino,Andrea Fontana,Beatrice Amico,Alessia Cozzolino,Daniele Gianfrilli,Giacomo Vitturi,Salvatore Crisafulli,Luca L’Abbate,Gianluca Trifirò

doi:10.1038/s41598-024-56240-w

Abstract

Acromegaly is a rare disease characterized by a diagnostic delay ranging from 5 to 10 years from the symptoms’ onset. The aim of this study was to develop and internally validate machine-learning algorithms to identify a combination of variables for the early diagnosis of acromegaly. This retrospective population-based study was conducted between 2011 and 2018 using data from the claims databases of Sicily Region, in Southern Italy. To identify combinations of potential predictors of acromegaly diagnosis, conditional and unconditional penalized multivariable logistic regression models and three machine learning algorithms (i.e., the Recursive Partitioning and Regression Tree, the Random Forest and the Support Vector Machine) were used, and their performance was evaluated. The random forest (RF) algorithm achieved the highest Area under the ROC Curve value of 0.83 (95% CI 0.79–0.87). The sensitivity in the test set, computed at the optimal threshold of predicted probabilities, ranged from 28% for the unconditional logistic regression model to 69% for the RF. Overall, the only diagnosis predictor selected by all five models and algorithms was the number of immunosuppressants-related pharmacy claims. The other predictors selected by at least two models were eventually combined in an unconditional logistic regression to develop a meta-score that achieved an acceptable discrimination accuracy (AUC = 0.71, 95% CI 0.66–0.75). Findings of this study showed that data-driven machine learning algorithms may play a role in supporting the early diagnosis of rare diseases such as acromegaly.

Full Text