Abstract

Emerging evidence places small proteins (≤50 amino acids) more centrally in physiological processes. Yet, their functional identification and the systematic genome annotation of their cognate small open-reading frames (smORFs) remains challenging both experimentally and computationally. Ribosome profiling or Ribo-Seq (that is a deep sequencing of ribosome-protected fragments) enables detecting of actively translated open-reading frames (ORFs) and empirical annotation of coding sequences (CDSs) using the in-register translation pattern that is characteristic for genuinely translating ribosomes. Multiple identifiers of ORFs that use the 3-nt periodicity in Ribo-Seq data sets have been successful in eukaryotic smORF annotation. They have difficulties evaluating prokaryotic genomes due to the unique architecture (e.g. polycistronic messages, overlapping ORFs, leaderless translation, non-canonical initiation etc.). Here, we present a new algorithm, smORFer, which performs with high accuracy in prokaryotic organisms in detecting putative smORFs. The unique feature of smORFer is that it uses an integrated approach and considers structural features of the genetic sequence along with in-frame translation and uses Fourier transform to convert these parameters into a measurable score to faithfully select smORFs. The algorithm is executed in a modular way, and dependent on the data available for a particular organism, different modules can be selected for smORF search.

Highlights

  • Next-generation sequencing (NGS) technologies enable a rapid and easy detection of genomic information of new species

  • The availability of various sequencing data (DNASeq, Ribo-Seq, TIS-Ribo-Seq) for different organisms may largely vary, we sought to develop an algorithm––smORFer––with a modular design which uses various data sets to detect putative small open reading frames (ORFs) (smORFs). smORFer combines three modules which utilize different inputs and can be used independently or in combination to increase the confidence in smORFs annotation (Figure 1)

  • Designed for annotating de novo smORFs using various data sets, smORFer presents remarkable advantages. It has a high efficiency in predicting smORFs with high probability to be expressed

Read more

Summary

Introduction

Next-generation sequencing (NGS) technologies enable a rapid and easy detection of genomic information of new species. After the pioneering effort of Fickett to unify concepts on how to define protein-coding sequences [1], further criteria have been added to increase the confidence in de novo identifications These include intrinsic signals involved in gene specifications (e.g. start and stop codon, splice sites), conservation patterns in related genomes with weighted conservation depending on evolutionary distance and verification with known ORFs or protein sequences [2,3]. These rules in the genome annotation protocols are performing well only on larger ORFs which span at least 100 codons [4,5], small ORFs (smORFs) shorter than 100 codons are systematically underrepresented and cannot be identified by common algorithms [6]. Systematic identification of functional small proteins or microproteins ( called micropeptides) remains challenging both experimentally and computationally

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.