Abstract

BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version – which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future.

Highlights

  • The massive production of bacterial genome sequences using NGS technologies is demanding new automated systems capable of getting an accurate annotation of a complete genome in a short enough time

  • Classical ORF prediction methods are totally dependent on the detection of start and stop codons

  • Each technology is prone to generate different types of errors: 454 technology generates deletions and insertions at homopolymeric regions [1] whilst Solid and Illumina technologies [2] generate substitutions, especially when coverage is not sufficient to correct the exact base at each position in the final consensus

Read more

Summary

Introduction

The massive production of bacterial genome sequences using NGS technologies is demanding new automated systems capable of getting an accurate annotation of a complete genome in a short enough time. Classical ORF prediction methods are totally dependent on the detection of start and stop codons. This is an efficient strategy whenever the sequencing technology has minimal sequence errors since only predicted ORFs have to be annotated - avoiding an important amount of unnecessary comparisons. All NGS technologies generate sequences with substitution, deletion and insertion errors. Each technology is prone to generate different types of errors: 454 technology generates deletions and insertions at homopolymeric regions [1] whilst Solid and Illumina technologies [2] generate substitutions, especially when coverage is not sufficient to correct the exact base at each position in the final consensus. The new read sizes of around thousands of bases that provide third generation technologies as Pacbio open new strategies for assembly

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.