Abstract

We present GenoSuite, an integrated proteogenomic pipeline to validate, refine and discover protein coding genes using high-throughput mass spectrometry (MS) data from prokaryotes. To demonstrate the effectiveness of GenoSuite, we analyzed proteomics data of Bradyrhizobium japonicum (USDA110), a model organism to study agriculturally important rhizobium-legume symbiosis. Our analysis confirmed 31% of known genes, refined 49 gene models for their translation initiation site (TIS) and discovered 59 novel protein coding genes. Notably, a novel protein which redefined the boundary of a crucial cytochrome P450 system related operon was discovered, known to be highly expressed in the anaerobic symbiotic bacteroids. A focused analysis on N-terminally acetylated peptides indicated downstream TIS for gene blr0594. Finally, ortho-proteogenomic analysis revealed three novel genes in recently sequenced B. japonicum USDA6(T) genome. The discovery of large number of missing genes and correction of gene models have expanded the proteomic landscape of B. japonicum and presents an unparalleled utility of proteogenomic analyses and versatility of GenoSuite for annotating prokaryotic genomes including pathogens.

Highlights

  • We developed an automated pipeline, GenoSuite to carry out genome translations, database searches using multiple search engines, result integration based on statistical significance of PSMs, False Discovery Rate (FDR) calculations, coordinate mapping, and finding completely novel genes

  • To check the consistency of Combined FDRScore calculation implemented in GenoSuite, we compared the combined FDRScores calculated by GenoSuite and FDRapp (37) from OMSSA and X!Tandem search results

  • Peptides from FDR filtered PSMs are mapped onto the genome to report novel protein coding regions (NPCRs) and gene model changes

Read more

Summary

Introduction

The method of harnessing mass spectrometry proteomic data to annotate genomes is generally referred as proteogenomics (8) This approach has been successfully applied to re-annotate several genomes (1, 9 –12) and to improve annotations of larger taxonomic groups than a single bacterium (1, 13). Multialgorithmic search approaches have been shown to increase sensitivity and specificity in large scale proteomic studies (15, 16) but are difficult to carry out in a proteogenomic context This is because of the lack of automated software for proteogenomic analyses that incorporates multiple search engines without compromising on the statistical robustness of individual algorithms. Its primary host is Soybean, an economically important crop and model system to study rhizobia-legume symbiosis This bacterium has a 9.1 Mb genome, one of the longest among bacteria, with 64.1% GC content (17). We identified 59 novel protein coding regions (NPCRs) and corrected annotations for 49 genes

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call