Abstract
Predicted open reading frames (ORFs) that lack detectable homology to known proteins are termed ORFans. Despite their prevalence in metagenomes, the extent to which ORFans encode real proteins, the degree to which they can be annotated, and their functional contributions, remain unclear. To gain insights into these questions, we applied sensitive remote-homology detection methods to functionally analyze ORFans from soil, marine, and human gut metagenome collections. ORFans were identified, clustered into sequence families, and annotated through profile-profile comparison to proteins of known structure. We found that a considerable number of metagenomic ORFans (73,896 of 484,121, 15.3%) exhibit significant remote homology to structurally characterized proteins, providing a means for ORFan functional profiling. The extent of detected remote homology far exceeds that obtained for artificial protein families (1.4%). As expected for real genes, the predicted functions of ORFans are significantly similar to the functions of their gene neighbors (p < 0.001). Compared to the functional profiles predicted through standard homology searches, ORFans show biologically intriguing differences. Many ORFan-enriched functions are virus-related and tend to reflect biological processes associated with extreme sequence diversity. Each environment also possesses a large number of unique ORFan families and functions, including some known to play important community roles such as gut microbial polysaccharide digestion. Lastly, ORFans are a valuable resource for finding novel enzymes of interest, as we demonstrate through the identification of hundreds of novel ORFan metalloproteases that all possess a signature catalytic motif despite a general lack of similarity to known proteins. Our ORFan functional predictions are a valuable resource for discovering novel protein families and exploring the boundaries of protein sequence space. All remote homology predictions are available at http://doxey.uwaterloo.ca/ORFans.
Highlights
Metagenomes are a rich resource of novel genes (Godzik, 2011) from which the metabolic and physiological activities of entire microbial communities can potentially be inferred (Handelsman, 2004)
Potential ORFans were identified as coding sequences (CDSs) whose products lacked detectable homology to known protein domain families (Pfam and Conserved Domain Database (CDD)) or proteins in the NCBI database
ORFans Are Shorter but Compositionally Similar to Real Proteins from their Environments we examined whether the detected ORFans share compositional characteristics with homology-annotatable CDSs from their environments
Summary
Metagenomes are a rich resource of novel genes (Godzik, 2011) from which the metabolic and physiological activities of entire microbial communities can potentially be inferred (Handelsman, 2004). This difficult task relies largely on the accuracy of current methods for predicting function from sequence, which is challenging even for single microbial genomes (Wooley et al, 2010). Metagenome-derived open reading frames (ORFs) are searched using BLAST (Altschul et al, 1997), or related tools, against reference protein databases such as the NCBI non-redundant (nr) and Swissprot databases. If functionally annotated hits in the databases are detected, functions are inherited from these hits
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.