Evaluation of open search methods based on theoretical mass spectra comparison

Dominique Tessier,Guillaume Fertin,Géraldine Jean,Albane Lysiak

doi:10.1186/s12859-021-03963-6

Abstract

BackgroundMass spectrometry remains the privileged method to characterize proteins. Nevertheless, most of the spectra generated by an experiment remain unidentified after their analysis, mostly because of the modifications they carry. Open Modification Search (OMS) methods offer a promising answer to this problem. However, assessing the quality of OMS identifications remains a difficult task.MethodsAiming at better understanding the relationship between (1) similarity of pairs of spectra provided by OMS methods and (2) relevance of their corresponding peptide sequences, we used a dataset composed of theoretical spectra only, on which we applied two OMS strategies. We also introduced two appropriately defined measures for evaluating the above mentioned spectra/sequence relevance in this context: one is a color classification representing the level of difficulty to retrieve the proper sequence of the peptide that generated the identified spectrum ; the other, called LIPR, is the proportion of common masses, in a given Peptide Spectrum Match (PSM), that represent dissimilar sequences. These two measures were also considered in conjunction with the False Discovery Rate (FDR).ResultsAccording to our measures, the strategy that selects the best candidate by taking the mass difference between two spectra into account yields better quality results. Besides, although the FDR remains an interesting indicator in OMS methods (as shown by LIPR), it is questionable: indeed, our color classification shows that a non negligible proportion of relevant spectra/sequence interpretations corresponds to PSMs coming from the decoy database.ConclusionsThe three above mentioned measures allowed us to clearly determine which of the two studied OMS strategies outperformed the other, both in terms of number of identifications and of accuracy of these identifications. Even though quality evaluation of PSMs in OMS methods remains challenging, the study of theoretical spectra is a favorable framework for going further in this direction.

Highlights

Mass spectrometry remains the privileged method to characterize proteins
We successively implemented Strategy1 and Strategy2 to compare all the theoretical spectra generated from the human proteome (572,063 spectra) against a database merging the target and decoy human proteins (1,148,608 spectra)
To denote unambiguously the role that each theoretical spectrum can alternately play, we call it bait when it plays the role of an experimental spectrum and hit when it represents the theoretical spectrum modeled from the protein database

Summary

Introduction

Mass spectrometry remains the privileged method to characterize proteins. most of the spectra generated by an experiment remain unidentified after their analysis, mostly because of the modifications they carry. It remains frustrating to observe that, in spite of an abundant literature on the subject, most of the spectra generated by this analytical technique—namely, tens of thousands of spectra per hour of analysis—are left unidentified after their analysis by a dedicated software. The reason behind this low rate of identification is likely due to the large proportion of spectra generated from proteins carrying modifications [3]. Some known modifications can be included in the modeling of reference spectra, but their number must remain low to circumscribe the search space

Methods

Results

Discussion

Conclusion