Tandem mass spectrometry is an indispensable technology for identification of proteins from complex mixtures. Accurate and sensitive analysis of large amounts of mass spectra data is a principal challenge in proteomics. Conventional deep learning-based peptide identification models usually adopt an encoder-decoder framework and generate target sequence from left to right without fully exploiting the global information. A few recent approaches seek to employ two-pass decoding, yet have limitations when facing the spectra filled with noise. In this paper, we propose a new paradigm for improved peptide identification, which first retrieves a similar mass spectrum from the database as a reference and then revise the matched sequence according to the difference information between the referenced spectrum and current context. The inspiration of design comes that the retrieved peptide-spectrum pair provides a good start point and indirect access to both past and future information, such that each revised amino acid can be produced with better noise perception and global understanding. Moreover, a disturb-based optimization process is introduced to sharpen the attention for difference vector with reinforcement learning before fed to decoder. Experimental results on several public datasets demonstrate that prominent performance boost is obtained with the proposed method. Remarkably, we achieve new state-of-the-art identification results on these datasets.
Read full abstract