Abstract

BackgroundTandem mass spectrometry-based database searching has become an important technology for peptide and protein identification. One of the key challenges in database searching is the remarkable increase in computational demand, brought about by the expansion of protein databases, semi- or non-specific enzymatic digestion, post-translational modifications and other factors. Some software tools choose peptide indexing to accelerate processing. However, peptide indexing requires a large amount of time and space for construction, especially for the non-specific digestion. Additionally, it is not flexible to use.ResultsWe developed an algorithm based on the longest common prefix (ABLCP) to efficiently organize a protein sequence database. The longest common prefix is a data structure that is always coupled to the suffix array. It eliminates redundant candidate peptides in databases and reduces the corresponding peptide-spectrum matching times, thereby decreasing the identification time. This algorithm is based on the property of the longest common prefix. Even enzymatic digestion poses a challenge to this property, but some adjustments can be made to this algorithm to ensure that no candidate peptides are omitted. Compared with peptide indexing, ABLCP requires much less time and space for construction and is subject to fewer restrictions.ConclusionsThe ABLCP algorithm can help to improve data analysis efficiency. A software tool implementing this algorithm is available at http://pfind.ict.ac.cn/pfind2dot5/index.htm

Highlights

  • Tandem mass spectrometry-based database searching has become an important technology for peptide and protein identification

  • We propose ABLCP, an algorithm based on the longest common prefix, to organize the database efficiently to retain the advantages and avoid the drawbacks of these approaches

  • ABLCP uses online digestion, it is subject to fewer restrictions

Read more

Summary

Introduction

Tandem mass spectrometry-based database searching has become an important technology for peptide and protein identification. One of the key challenges in database searching is the remarkable increase in computational demand, brought about by the expansion of protein databases, semi- or non-specific enzymatic digestion, post-translational modifications and other factors. The existing tools are not quick enough, for the following reasons: First, the size of protein databases is increasing significantly, resulting in many peptides. Semi- or non-specific digestion generates 10 to 100 times more peptides than full-specific digestion. The number of non-redundant peptides generated by full-specific digestion with up to two missed cleavage sites in the IPI-Human V3.65 database [11] is 3549956, and it increases 170-fold to 626871441 for non-specific digestion

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.