Abstract

BackgroundThe proportion of conserved DNA sequences with no clear function is steadily growing in bioinformatics databases. Studies of sequence and structural homology have indicated that many uncharacterized protein domain sequences are variants of functionally described domains. If these variants promote an organism's ecological fitness, they are likely to be conserved in the genome of its progeny and the population at large. The genetic composition of microbial communities in their native ecosystems is accessible through metagenomics. We hypothesize the co-variation of protein domain sequences across metagenomes from similar ecosystems will provide insights into their potential roles and aid further investigation.Methodology/Principal findingsWe calculated the correlation of Pfam protein domain sequences across the Global Ocean Sampling metagenome collection, employing conservative detection and correlation thresholds to limit results to well-supported hits and associations. We then examined intercorrelations between domains of unknown function (DUFs) and domains involved in known metabolic pathways using network visualization and cluster-detection tools. We used a cautious “guilty-by-association” approach, referencing knowledge-level resources to identify and discuss associations that offer insight into DUF function. We observed numerous DUFs associated to photobiologically active domains and prevalent in the Cyanobacteria. Other clusters included DUFs associated with DNA maintenance and repair, inorganic nutrient metabolism, and sodium-translocating transport domains. We also observed a number of clusters reflecting known metabolic associations and cases that predicted functional reclassification of DUFs.Conclusion/SignificanceCritically examining domain covariation across metagenomic datasets can grant new perspectives on the roles and associations of DUFs in an ecological setting. Targeted attempts at DUF characterization in the laboratory or in silico may draw from these insights and opportunities to discover new associations and corroborate existing ones will arise as more large-scale metagenomic datasets emerge.

Highlights

  • In recent years, genomic sequencing projects have revealed a large number of novel genes across a wide range of organisms and environments

  • We observed that intercorrelation of protein domain sequences across intra-ecosystem metagenomic datasets can provide perspectives on the potential roles of domains of unknown function

  • Even strong correlation across metagenomic datasets cannot provide direct functional annotations, as numerous factors may account for domain covariance in natural systems

Read more

Summary

Introduction

Genomic sequencing projects have revealed a large number of novel genes across a wide range of organisms and environments Many of these have poor sequence-level similarity to genes that have been characterized in a laboratory setting and, have not been annotated with functional roles. The Pfam 24 database [1] stored some 11,912 protein domain families derived from conserved sequence data with ,26% dubbed ‘‘domains of unknown function’’ (DUFs). This proportion is predicted to soon overtake that of functionally characterized domains [2], and calls for community action [3] and cross-disciplinary efforts [4] towards their identification have been made. We hypothesize the co-variation of protein domain sequences across metagenomes from similar ecosystems will provide insights into their potential roles and aid further investigation

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call