Rapid Identification of Sequences for Orphan Enzymes to Power Accurate Protein Annotation

Kevin R Ramkissoon,Amit K Galande,Martha G Bomar,Douglas S Watson,Sunil Ojha,Jennifer K Miller,Alexander G Shearer,John Parkinson

doi:10.1371/journal.pone.0084508

Abstract

The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the “back catalog” of enzymology – “orphan enzymes,” those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme “back catalog” is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology’s “back catalog” another powerful tool to drive accurate genome annotation.

Highlights

The advent of high-throughput genome sequencing technologies has resulted in an unprecedented increase in the rate of microbial genome sequencing
To identify candidate orphans to help ask the current question of ‘‘Can we shortcut the most timeconsuming parts of the work?’’ we used a list of 1,123 putative orphan enzyme activities and their associated data (Shearer et al, manuscript in preparation)
We searched the published literature (1986–2011) using PubMed and Google Scholar to identify recent and continuing experimental work that might suggest the availability of protein samples that could be readily obtained for mass spectrometry analysis

Summary

Introduction

The advent of high-throughput genome sequencing technologies has resulted in an unprecedented increase in the rate of microbial genome sequencing. Over 500 newly completed and annotated genomes were released via the NCBI site in 2011 alone – about 1.37 genomes per day. As genomes are sequenced, automated methods are used to identify open reading frames, translate protein sequences, and assign function by transfer from a homolog using simple pairwise sequence comparisons [1]. These automated functional annotations have been shown to have large errors, ranging from 30% to as high as 80% in some superfamilies [2,3]. A lack of high quality annotations can have wideranging impacts, from gene or protein identification limitations in large-scale genetic and proteomic studies to failures in modeling the biology of novel organisms

Objectives

Methods

Results

Conclusion