Abstract Background The application of machine learning (ML) and artificial intelligence (AI) in precision medicine is of great interest and continues to grow. ML approaches offer a flexible and data-centric methodology for modeling diagnostic and prognostic trends, capable of identifying complex non-linear patterns and relationships between predictor variables and outcomes within large data sets. Although ML holds great promise, there are several common pitfalls that must be avoided for ML models to provide clinical utility. In the current study, we provide a rigorous and systematic evaluation of the utility of ML algorithms to predict disease in several representative data sets generated via multiplexed biomarker assays. We compare five major classes of ML algorithms compared to more traditional combinatorial analysis (logistic regression) to offer perspective against the current gold standard. Methods We obtained five data sets from published literature that include measurements of multiplexed biomarkers in patients with obstructive sleep apnea, sepsis, and breast and colon cancers in comparison to healthy controls. These datasets were generated by bead-based and planar multiplexed immunoassays. We chose representative ML algorithms from the classes of support vector machines (SVM), random forests (RF), neural networks (NNet), and extreme gradient boosting (xgBoost) in comparison to logistic regression (LR). Datasets were randomly split into internal validation and external validation sets via 10 iterations of permutation analysis. We tuned and validated the models using 5-fold cross validation, then tested those on external subsets. We report the average of several model classification metrics and associated variances across all resampled data sets. Results For the classification of sepsis using inflammatory cytokines measured on the Luminex bead-based immunoassay, RF outperformed LR in external validation sets by a 9.8% increase in AUROC. For classification of breast cancer patients, xgBoost and RF performed best with 20.0% and 16.9% increases in AUROC, respectively, whereas for classification of colon cancer patients, NNet outperformed LR by 11.5%. Across all datasets, on average, all ML algorithms outperformed LR in the internal validation sets by a 7.9% (6.3% - 9.9%) increase in AUROC. However, in external validation sets, only xgBoost and SVM outperformed LR, where RF and NNet showed signs of overfitting. Conclusions Overall, certain ML algorithms show improvement over LR for diagnostic applications using multiplexed assays, by up to 20% increase in AUROC. Interpretation of ML results must be performed rigorously to prevent overly optimistic and unrealistic conclusions. In the context of laboratory medicine, our results showcase the utility of ML in clinical applications, while highlighting potential disadvantages of ML. This analysis underscores the need for establishing rigorous guidelines to support data analytics during the development of novel multiplexed assays, which is needed to advance precision medicine.