Abstract
Prokaryotic genome annotation is heavily dependent on automated gene annotation pipelines that are prone to propagate errors and underestimate genome complexity. We describe an optimized proteogenomic workflow that uses ribosome profiling (ribo-seq) and proteomic data for Salmonella enterica serovar Typhimurium to identify unannotated proteins or alternative protein forms. This data analysis encompasses the searching of cofragmenting peptides and postprocessing with extended peptide-to-spectrum quality features, including comparison to predicted fragment ion intensities. When this strategy is applied, an enhanced proteome depth is achieved, as well as greater confidence for unannotated peptide hits. We demonstrate the general applicability of our pipeline by reanalyzing public Deinococcus radiodurans data sets. Taken together, our results show that systematic reanalysis using available prokaryotic (proteome) data sets holds great promise to assist in experimentally based genome annotation.IMPORTANCE Delineation of open reading frames (ORFs) causes persistent inconsistencies in prokaryote genome annotation. We demonstrate that by advanced (re)analysis of omics data, a higher proteome coverage and sensitive detection of unannotated ORFs can be achieved, which can be exploited for conditional bacterial genome (re)annotation, which is especially relevant in view of annotating the wealth of sequenced prokaryotic genomes obtained in recent years.
Highlights
Prokaryotic genome annotation is heavily dependent on automated gene annotation pipelines that are prone to propagate errors and underestimate genome complexity
Since we were striving to search the full complement of possible genomic open reading frames (ORFs), all theoretical ORFs with a minimal length of 30 nucleotides and initiated from canonical ATG or near-cognate GTG and TTG start codons of the S
We applied an optimized proteogenomics workflow for bacteria to identify unannotated protein-coding ORFs and correct existing annotations
Summary
Prokaryotic genome annotation is heavily dependent on automated gene annotation pipelines that are prone to propagate errors and underestimate genome complexity. We describe an optimized proteogenomic workflow that uses ribosome profiling (ribo-seq) and proteomic data for Salmonella enterica serovar Typhimurium to identify unannotated proteins or alternative protein forms This data analysis encompasses the searching of cofragmenting peptides and postprocessing with extended peptide-to-spectrum quality features, including comparison to predicted fragment ion intensities. In the widely adopted NCBI prokaryotic genome annotation pipeline, protein start site annotation, sequencing errors giving rise to interrupted genes, and the delineation of open reading frames (ORFs) based on homology or ab initio predictions remain persistent problems [5]. Comparing these predicted fragment ion intensities with those matched fragment ions aids in discriminating correct PSMs by machine learning [20,21,22] As such, these fragment intensity correlation features are especially useful for attaining a higher confidence for novel (i.e., database unannotated) peptide identifications [23, 24]. We further elaborate how (ribo)proteogenomics is instrumental in reannotating ORFs, the discovery of novel ORFs across bacteria, and genome annotation in general
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.