Abstract
In proteomics, peptide information within mass spectrometry (MS) data from a specific organism sample is routinely matched against a protein sequence database that best represent such organism. However, if the species/strain in the sample is unknown or genetically poorly characterized, it becomes challenging to determine a database which can represent such sample. Building customized protein sequence databases merging multiple strains for a given species has become a strategy to overcome such restrictions. However, as more genetic information is publicly available and interesting genetic features such as the existence of pan- and core genes within a species are revealed, we questioned how efficient such merging strategies are to report relevant information. To test this assumption, we constructed databases containing conserved and unique sequences for 10 different species. Features that are relevant for probabilistic-based protein identification by proteomics were then monitored. As expected, increase in database complexity correlates with pangenomic complexity. However, Mycobacterium tuberculosis and Bordetella pertussis generated very complex databases even having low pangenomic complexity. We further tested database performance by using MS data from eight clinical strains from M. tuberculosis, and from two published datasets from Staphylococcus aureus. We show that by using an approach where database size is controlled by removing repeated identical tryptic sequences across strains/species, computational time can be reduced drastically as database complexity increases.
Highlights
In the past decade, the development of Generation Sequencing methods has made genome sequencing more affordable and more accessible
The use of accurate protein sequence databases is a key step in most proteomic approaches, and this is critical for studies involving samples containing strains with little to no genetic information or samples containing multiple strains (Tanca et al, 2013)
This has been largely employed in proteogenomics, where “novel” sequences are inserted in the customized database and, if identified, are further used to validate and confirm proposed gene models and other genetic polymorphisms
Summary
The development of Generation Sequencing methods has made genome sequencing more affordable and more accessible. The use of accurate protein sequence databases is a key step in most proteomic approaches, and this is critical for studies involving samples containing strains with little to no genetic information or samples containing multiple strains (metaproteomics) (Tanca et al, 2013) In such cases when the establishment of a gold-standard annotation that can represent the sample under investigation is difficult, a viable alternative is to construct customized protein sequence databases which are inspected against peptide sequence data collected by MS (Nesvizhskii, 2014). Database customization is often achieved using two different strategies: (i) through a 6frame translation of the genome of the strain (Fermin et al, 2006; Baerenfaller et al, 2008); (ii) or by constructing a database merging ab initio gene predictions from related strains of the same species, taking into consideration variations caused by SNPs, indels, divergent TSS choice, among others (de Souza et al, 2010; Omasits et al, 2017). These approaches are not mutually exclusive, as gene annotation from related strains can be used to further optimize 6-frame translation approaches (Castellana et al, 2008)
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have