Abstract

Small-molecule metabolites are principal actors in myriad phenomena across biochemistry and serve as an important source of biomarkers and drug candidates. Given a sample of unknown composition, identifying the metabolites present is difficult given the large number of small molecules both known and yet to be discovered. Even for biofluids such as human blood, building reliable ways of identifying biomarkers is challenging. A workhorse method for characterizing individual molecules in such untargeted metabolomics studies is tandem mass spectrometry (MS/MS). MS/MS spectra provide rich information about chemical composition. However, structural characterization from spectra corresponding to unknown molecules remains a bottleneck in metabolomics. Current methods often rely on matching to pre-existing databases in one form or another. Here we develop a preprocessing scheme and supervised topic modeling approach to identify modular groups of spectrum fragments and neutral losses corresponding to chemical substructures using labeled latent Dirichlet allocation (LLDA) to map spectrum features to known chemical structures. These structures appear in new unknown spectra and can be predicted. We find that LLDA is an interpretable and reliable method for structure prediction from MS/MS spectra. Specifically, the LLDA approach has the following advantages: (a) molecular topics are interpretable; (b) A practitioner can select any set of chemical structure labels relevant to their problem; (c ) LLDA performs well and can exceed the performance of other methods in predicting substructures in novel contexts.

Highlights

  • Liquid chromatography - tandem mass spectrometry (LC-MS/ MS) is a powerful experimental method for identifying the small molecule metabolites in a sample of unknown composition

  • Using cosine distance k-nearest neighbors (k-NN) for spectral library matching, we find that labeled latent Dirichlet allocation (LLDA)’s relative performance improves as the test set becomes more chemically distinct from the training set and as the substructures being predicted appear with different frequencies between the two sets

  • Improved computational methods for identifying chemical structure from metabolomics studies is tandem mass spectrometry (MS/MS) spectra are needed for this promise to become a reality

Read more

Summary

Introduction

Liquid chromatography - tandem mass spectrometry (LC-MS/ MS) is a powerful experimental method for identifying the small molecule metabolites in a sample of unknown composition. It provides detailed structural information from a given molecule with the only prerequisite knowledge being the parent molecule’s mass-to-charge ratio. This is especially important since a vast portion of naturally occurring small molecules are believed to remain unidentified[1]. Identifying the structure of a molecule given its MS/MS remains challenging[2]. The repetitive nature of this task naturally lends itself well to a computational approach[3]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call