Abstract A common cause of erroneous laboratory results is preanalytical contamination of samples with intravenous crystalloids. Rules-based approaches, such as delta checks or feasibility cut-offs, can detect these errors, but rely on prior laboratory results and lack sensitivity. In this study, we aimed to improve the detection of crystalloid contamination using two machine learning workflows for the detection of contamination by the four most common crystalloids; normal saline (NS), lactated Ringer’s (LR), and their dextrose-containing counterparts (D5NS and D5LR). First, we employed a semi-supervised approach using manifold approximation and contrastive learning. To do this, we aggregated five years' worth of basic metabolic panel results (n = 67 million), then simulated contaminated samples by mixing random subsets of samples in silico with the crystalloid solutions at different ratios. The dataset was partitioned into a 70:30 split between training and testing sets with five-fold cross-validation. An autoencoder was trained to create an embedding of the data by minimizing the triplet loss. This embedding was then projected onto a two-dimensional manifold. When the four crystalloid solutions and the independent test set were mapped to the manifold, decision boundaries were drawn based on the threshold that maximized the F1 statistic, and labels were assigned. This method detected the 10% in silico mixtures with an area under the receiver operating characteristic curve (AUC) of 0.97, 0.94, 0.99, and 0.99 for NS, LR, D5NS, and D5LR respectively. Next, we evaluated a fully-supervised approach by training and validating Random Forest and XGBoost models using the same training and testing sets as above. Both models performed well, with XGBoost slightly outperforming Random Forest. The XGBoost model demonstrated AUCs of 0.992, 0.990, 0.998, and 0.996 for the NS, LR, D5NS, and D5LR mixtures. This high-level performance was maintained even as the proportion of mixed samples in the dataset was decreased from 50% to 1% to more accurately reflect real-world contamination rates. For NS and LR, the variables with the highest relative importance were calcium and chloride. For their D5 counterparts, glucose and calcium were most important. Creatinine and blood urea nitrogen were the least important variables in both approaches. Overall, the fully-supervised learning approach outperformed that of the contrastive embeddings for the specific task of identifying contamination from pre-defined crystalloids. One advantage of the semi-supervised approach is that it could be adapted to identify outliers that are not pre-defined. However, considering the relative ease of translating tree-based models into practice, we conclude that our XGBoost model offers the most promising solution and plan to prospectively validate this model for real-time detection of sample contamination.
Read full abstract