Abstract

In the past, short protein-coding genes were often disregarded by genome annotation pipelines. Transcriptome sequencing (RNAseq) signals outside of annotated genes have usually been interpreted to indicate either ncRNA or pervasive transcription. Therefore, in addition to the transcriptome, the translatome (RIBOseq) of the enteric pathogen Escherichia coli O157:H7 strain Sakai was determined at two optimal growth conditions and a severe stress condition combining low temperature and high osmotic pressure. All intergenic open reading frames potentially encoding a protein of ≥ 30 amino acids were investigated with regard to coverage by transcription and translation signals and their translatability expressed by the ribosomal coverage value. This led to discovery of 465 unique, putative novel genes not yet annotated in this E. coli strain, which are evenly distributed over both DNA strands of the genome. For 255 of the novel genes, annotated homologs in other bacteria were found, and a machine-learning algorithm, trained on small protein-coding E. coli genes, predicted that 89% of these translated open reading frames represent bona fide genes. The remaining 210 putative novel genes without annotated homologs were compared to the 255 novel genes with homologs and to 250 short annotated genes of this E. coli strain. All three groups turned out to be similar with respect to their translatability distribution, fractions of differentially regulated genes, secondary structure composition, and the distribution of evolutionary constraint, suggesting that both novel groups represent legitimate genes. However, the machine-learning algorithm only recognized a small fraction of the 210 genes without annotated homologs. It is possible that these genes represent a novel group of genes, which have unusual features dissimilar to the genes of the machine-learning algorithm training set.

Highlights

  • The pathogenic E. coli strain O157:H7 Sakai (EHEC) was first isolated in 1996 from an outbreak in Japan [1]

  • The novel putative genes were consecutively numbered in the order they appear in the EHEC genome (XECs001XECs465)

  • 130 novel genes were detected in Salmonella [40] and 72 novel genes were detected in EHEC strain EDL933 [11]

Read more

Summary

Introduction

The pathogenic E. coli strain O157:H7 Sakai (EHEC) was first isolated in 1996 from an outbreak in Japan [1]. In addition to humans [3] and contaminated food, EHEC persists in many environments, such as soil [4], plants [5], invertebrates [6], and cattle [7]. These environments represent various challenges requiring expression of a different set of bacterial genes [8]. After sequencing a bacterial genome, bioinformatics tools, such as GLIMMER [12] or RAST [13] are used for gene prediction and annotation. Small proteins have recently come more into focus [18, 19], the majority of them still belong to the ‘dark proteome’ lacking known folds or domains, rendering putative functional assignments using bioinformatics tools impossible [20, 21]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.