Abstract

BackgroundTandem mass spectrometry allows biologists to identify and quantify protein samples in the form of digested peptide sequences. When performing peptide identification, spectral library search is more sensitive than traditional database search but is limited to peptides that have been previously identified. An accurate tandem mass spectrum prediction tool is thus crucial in expanding the peptide space and increasing the coverage of spectral library search.ResultsWe propose MS2CNN, a non-linear regression model based on deep convolutional neural networks, a deep learning algorithm. The features for our model are amino acid composition, predicted secondary structure, and physical-chemical features such as isoelectric point, aromaticity, helicity, hydrophobicity, and basicity. MS2CNN was trained with five-fold cross validation on a three-way data split on the large-scale human HCD MS2 dataset of Orbitrap LC-MS/MS downloaded from the National Institute of Standards and Technology. It was then evaluated on a publicly available independent test dataset of human HeLa cell lysate from LC-MS experiments. On average, our model shows better cosine similarity and Pearson correlation coefficient (0.690 and 0.632) than MS2PIP (0.647 and 0.601) and is comparable with pDeep (0.692 and 0.642). Notably, for the more complex MS2 spectra of 3+ peptides, MS2PIP is significantly better than both MS2PIP and pDeep.ConclusionsWe showed that MS2CNN outperforms MS2PIP for 2+ and 3+ peptides and pDeep for 3+ peptides. This implies that MS2CNN, the proposed convolutional neural network model, generates highly accurate MS2 spectra for LC-MS/MS experiments using Orbitrap machines, which can be of great help in protein and peptide identifications. The results suggest that incorporating more data for deep learning model may improve performance.

Highlights

  • Tandem mass spectrometry allows biologists to identify and quantify protein samples in the form of digested peptide sequences

  • MS2CNN achieved a cosine similarity (COS) in the range of 0.57–0.79 and 0.59–0.74 for peptides of charge 2+ and charge 3+, respectively. These results suggest that MS2CNN significantly outperforms MS2PIP, especially for shorter peptide sequences for which abundant training data is available

  • Five-fold cross validation for determining convolutional layer Because there are significantly more charge 2+ than charge 3+ peptide sequences, the best layer number of MS2CNN is determined by charge 2+, after which the value is directly applied to charge 3+

Read more

Summary

Introduction

Tandem mass spectrometry allows biologists to identify and quantify protein samples in the form of digested peptide sequences. There are two common approaches for protein identification: database search and spectral library search The former searches each tandem mass spectrum (or MS2 spectrum) acquired from experiments against theoretical spectrums generated from all possible digested peptides (with trypsin in most of the cases) in the human proteome using a scoring function. The latter searches a MS2 spectrum against a spectral library, a collection of highquality spectra of all identified peptides from previous experiments [2]. To take this into account, it is necessary to develop methods for computational prediction or simulation of MS2 spectra from amino acid sequences to expand the size of a spectral library

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call