Anaphylaxis is a severe life-threatening allergic reaction, and its accurate identification in healthcare databases can harness the potential of "Big Data" for healthcare or public health purposes. This study used claims data obtained between October 1, 2015 and February 28, 2019 from the CMS database to examine the utility of machine learning in identifying incident anaphylaxis cases. We created a feature selection pipeline to identify critical features between different datasets. Then a variety of unsupervised and supervised methods were used (eg, Sammon mapping and eXtreme Gradient Boosting) to train models on datasets of differing data quality, which reflects the varying availability and potential rarity of ground truth data in medical databases. Resulting machine learning model accuracies ranged from 47.7% to 94.4% when tested on ground truth data. Finally, we found new features to help experts enhance existing case-finding algorithms. Developing precise algorithms to detect medical outcomes in claims can be a laborious and expensive process, particularly for conditions presented and coded diversely. We found it beneficial to filter out highly potent codes used for data curation to identify underlying patterns and features. To improve rule-based algorithms where necessary, researchers could use model explainers to determine noteworthy features, which could then be shared with experts and included in the algorithm. Our work suggests machine learning models can perform at similar levels as a previously published expert case-finding algorithm, while also having the potential to improve performance or streamline algorithm construction processes by identifying new relevant features for algorithm construction.
Read full abstract