Abstract

Historically, the database search algorithms have been the de facto standard for inferring peptides from mass spectrometry (MS) data. Database search algorithms deduce peptides by transforming theoretical peptides into theoretical spectra and matching them to the experimental spectra. Heuristic similarity-scoring functions are used to match an experimental spectrum to a theoretical spectrum. However, the heuristic nature of the scoring functions and the simple transformation of the peptides into theoretical spectra, along with noisy mass spectra for the less abundant peptides, can introduce a cascade of inaccuracies. In this paper, we design and implement a Deep Cross-Modal Similarity Network called SpeCollate, which overcomes these inaccuracies by learning the similarity function between experimental spectra and peptides directly from the labeled MS data. SpeCollate transforms spectra and peptides into a shared Euclidean subspace by learning fixed size embeddings for both. Our proposed deep-learning network trains on sextuplets of positive and negative examples coupled with our custom-designed SNAP-loss function. Online hardest negative mining is used to select the appropriate negative examples for optimal training performance. We use 4.8 million sextuplets obtained from the NIST and MassIVE peptide libraries to train the network and demonstrate that for closed search, SpeCollate is able to perform better than Crux and MSFragger in terms of the number of peptide-spectrum matches (PSMs) and unique peptides identified under 1% FDR for real-world data. SpeCollate also identifies a large number of peptides not reported by either Crux or MSFragger. To the best of our knowledge, our proposed SpeCollate is the first deep-learning network that can determine the cross-modal similarity between peptides and mass-spectra for MS-based proteomics. We believe SpeCollate is significant progress towards developing machine-learning solutions for MS-based omics data analysis. SpeCollate is available at https://deepspecs.github.io/.

Highlights

  • To date, mass spectrometry (MS) proteomics data is identified using database search algorithms purely based on numerical techniques (Fig 1)

  • As SpeCollate generates charge-independent peptide embeddings, we demonstrate that it performs better than Crux and MSFragger in terms of the number of peptide-spectrum matches (PSMs) and peptides identified under 1% FDR

  • Proteomics has entered the realm of Big-Data, and the number of available labeled and annotated spectra is increasing rapidly, enabling sophisticated models to be trained

Read more

Summary

Introduction

Mass spectrometry (MS) proteomics data is identified using database search algorithms purely based on numerical techniques (Fig 1). The shortcomings, including limitations and oversights of the existing numerical techniques, bounded performance of spectral simulators, unoptimized scoring heuristics, and the opportunities made available by huge data repositories with labeled spectra, are discussed. Peptides and their corresponding MS/MS spectra lie in vastly distinct spaces. We argue that a more flexible technique that can learn intermediate embeddings for both spectra and peptides could improve database search quality

Objectives
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call