Machine Learning Informed Diagnosis for Congenital Heart Disease in Large Claims Data Source

Ariane J Marelli,Chao Li,Aihua Liu,Hanh Nguyen,Harry Moroz,James M Brophy,Liming Guo,David L Buckeridge,Jian Tang,Archer Yang,Yue Li

doi:10.1016/j.jacadv.2023.100801

Abstract

BackgroundWith an increasing interest in using large claims databases in medical practice and research, it is a meaningful and essential step to efficiently identify patients with the disease of interest. ObjectivesThis study aims to establish a machine learning (ML) approach to identify patients with congenital heart disease (CHD) in large claims databases. MethodsWe harnessed data from the Quebec claims and hospitalization databases from 1983 to 2000. The study included 19,187 patients. Of them, 3,784 were labeled as true CHD patients using a clinician developed algorithm with manual audits considered as the gold standards. To establish an accurate ML-empowered automated CHD classification system, we evaluated ML methods including Gradient Boosting Decision Tree, Support Vector Machine, Decision tree, and compared them to regularized logistic regression. The Area Under the Precision Recall Curve was used as the evaluation metric. External validation was conducted with an updated data set to 2010 with different subjects. ResultsAmong the ML methods we evaluated, Gradient Boosting Decision Tree led the performance in identifying true CHD patients with 99.3% Area Under the Precision Recall Curve, 98.0% for sensitivity, and 99.7% for specificity. External validation returned similar statistics on model performance. ConclusionsThis study shows that a tedious and time-consuming clinical inspection for CHD patient identification can be replaced by an extremely efficient ML algorithm in large claims database. Our findings demonstrate that ML methods can be used to automate complicated algorithms to identify patients with complex diseases.

Full Text