Abstract Quantification of circulating parathyroid hormone-related peptide (PTHrP) aids in the diagnosis of humoral hypercalcemia of malignancy. However, this test is often ordered in settings of low pre-test probability or mistakenly ordered when parathyroid hormone (PTH) was desired. To improve utilization, all PTHrP orders at our institutions are reviewed by a laboratory medicine resident. If the order appears to be inappropriate, the ordering physician is contacted to ask for permission to cancel. In this work, we attempted to automate this labor-intensive review process by developing machine learning models to predict the orders the physician is willing to cancel. We collected 2171 PTHrP orders that were subjected to manual review over the past 10 years. We removed any repeats, leaving 1649 first-time orders. For each order, we assigned a class label of ‘canceled’ or ‘completed’ based upon the notes in the resident’s documentation logs. For each order, we aggregated all data for the patient existing within the laboratory information system at the time of the first order (n = 40 million). Various strategies were applied to impute missing data, including leaving missingness as a feature, fill by medians, k-nearest neighbors, bagged trees, and linear regression. Class imbalances were adjusted using synthetic minority upsampling technique (SMOTE) and adaptive synthetic upsampling (ADASYN) for a final ratio of one:one. The dataset was partitioned into a 70:30 split between training and testing sets with five-fold cross-validation. Several machine learning algorithms were trained, including logistic regression, naive Bayes, random forest, and XGBoost. After training and cross-validation, the models were applied to the held-out test set, and performance was evaluated using the area under the receiver operating characteristic curve (AUC). XGBoost was the best performer at predicting the provider’s likelihood to cancel the test, but with an AUC of only 0.63. Surprised by this poor performance, we devised a second classification task of predicting the PTHrP results (normal vs abnormal, threshold = 4.2 pmol/L) for the subset of orders that were completed (n=1371). Using the same machine learning pipeline described above, we again observed that XGBoost was the best performer, but this time with an AUC of 0.89. The striking performance difference between the models trained on the two different targets suggests that the physician’s willingness to assent to our intervention may be unrelated to existing laboratory data or underlying biology. Likely explanations may include that our intervention is reaching a provider who is not the primary medical decision-maker or the correct provider at the wrong time when they are too busy to revisit details of prior work. In either case, a reflex order set that defines specific criteria for performing PHTrP upfront may be a more effective way to improve utilization.
Read full abstract