Abstract

We provide a pipeline for data preprocessing, biomarker selection, and classification of liquid chromatography–mass spectrometry (LCMS) serum samples to generate a prospective diagnostic test for Lyme disease. We utilize tools of machine learning (ML), e.g., sparse support vector machines (SSVM), iterative feature removal (IFR), and k-fold feature ranking to select several biomarkers and build a discriminant model for Lyme disease. We report a 98.13% test balanced success rate (BSR) of our model based on a sequestered test set of LCMS serum samples. The methodology employed is general and can be readily adapted to other LCMS, or metabolomics, data sets.

Highlights

  • We provide a pipeline for data preprocessing, biomarker selection, and classification of liquid chromatography–mass spectrometry (LCMS) serum samples to generate a prospective diagnostic test for Lyme disease

  • After the untargeted selection in XCMS we checked for missingness in the data to identify features with missing values in more than 80% of training samples—none of features met this criterion

  • Relative to the 44 LC-MS biomarkers discovered and LASSO diagnostic developed in Molins et al our sparse support vector machines (SSVM) diagnostic shows an 8.35% increase in test sensitivity and a 5.00% increase in test s­ pecificity[5]

Read more

Summary

Introduction

We provide a pipeline for data preprocessing, biomarker selection, and classification of liquid chromatography–mass spectrometry (LCMS) serum samples to generate a prospective diagnostic test for Lyme disease. We begin with the hypothesis that feature vectors, or the vectors of metabolite peak areas, for patients with Lyme disease and their healthy counterparts are separated in space when restricted to some reduced set of discriminatory biomarkers This is the base assumption of sparse, or minimal feature, models for feature selection. Multivariate models in statistics and ML, such as partial least squares-discriminant analysis (PLS-DA), kernel support vector machines, deep learning networks, and decision trees, can over-fit when training on data sets with many features and relatively few ­samples[12,13,14] This may be mitigated through hyperparameter tuning: controlling the balance between training and validation accuracy in a cross-validation experiment. Using a sparsity inducing penalty in the SSVM optimization problem reduces the number of parameters available to the model and serves to prevent over-fitting by regularizing the high-dimensional model

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call