Abstract
Generally, bottom-up and top-down are two complementary approaches for proteoforms identification. The inference of proteoforms relies on searching mass spectra against an accurate proteoform sequence database. A customized protein sequence database derived by RNA-Seq data can be used to better identify the proteoform existed in a studied species. However, the quality of sequences in customized databases which constructed by different strategies affect the performances of mass spectrometry (MS) identification. Additionally, performances of identifications between bottom-up and top-down using customized databases are also needed to be evaluated. Three customized databases were constructed with different strategies separately. Two of them were based on translating assembled transcripts with or without genomic annotation, and the third one is a variant-extending protein database. By testing with bottom-up and top-down MS data separately, a variant-extending protein database could identify not only the most number of spectra but also the alleles expressed at the same time in diploid cells. An assembled database could identify the spectrum missed in reference database and amino acid (AA) alterations existed in studied species. Experimental results demonstrated that the proteoform sequences in an annotated database are more suitable for identifying AA alterations and peptide sequences missed in reference database. An unannotated database instead of a reference proteome database gets an enough high sensitivity of identifying mass spectra. The variant-extending reference database is the most sensitive to identify mass spectra and single AA variants. Supplementary data are available at Bioinformatics online.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have