Abstract

BackgroundBiochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism and one species. This vision is being dramatically changed by the advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial populations in fundamental biochemical functions. The new landscape we face requires the reconstruction of biochemical and regulatory pathways at the community level in a given environment. In order to understand how environmental factors affect the genetic material and the dynamics of the expression from one environment to another, we want to evaluate the quantity of gene protein sequences or transcripts associated to a given pathway by precisely estimating the abundance of protein domains, their weak presence or absence in environmental samples.ResultsMetaCLADE is a novel profile-based domain annotation pipeline based on a multi-source domain annotation strategy. It applies directly to reads and improves identification of the catalog of functions in microbiomes. MetaCLADE is applied to simulated data and to more than ten metagenomic and metatranscriptomic datasets from different environments where it outperforms InterProScan in the number of annotated domains. It is compared to the state-of-the-art non-profile-based and profile-based methods, UProC and HMM-GRASPx, showing complementary predictions to UProC. A combination of MetaCLADE and UProC improves even further the functional annotation of environmental samples.ConclusionsLearning about the functional activity of environmental microbial communities is a crucial step to understand microbial interactions and large-scale environmental impact. MetaCLADE has been explicitly designed for metagenomic and metatranscriptomic data and allows for the discovery of patterns in divergent sequences, thanks to its multi-source strategy. MetaCLADE highly improves current domain annotation methods and reaches a fine degree of accuracy in annotation of very different environments such as soil and marine ecosystems, ancient metagenomes and human tissues.

Highlights

  • Biochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism and one species

  • With the introduction of MetaCLADE, we push forward this idea and we show that clade-centered models (CCMs) can be used to successfully annotate fragmented coding sequences in MG/MT datasets, where domain divergence and species variability might be very large

  • It takes a dataset of reads in input and searches for domains using a library of more than two million probabilistic models (CCMs and sequence consensus models (SCM)) [43] associated to almost 15,000 Pfam domains

Read more

Summary

Introduction

Biochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism and one species. Computational studies improving the detection of the functional preferences of environmental communities are important for gaining insight into ecosystem changes [10,11,12,13,14,15] They shall quantitatively relate genetic information with environmental factors in order to understand how these factors affect the genetic material and the dynamics of the expression from one environment to another, from one community to another. A second difficulty is that environmental coding sequences are fragmented and annotation of partial information becomes harder due to a much reduced sequence length In this respect, since protein-coding sequences might be too long compared to reads, in environmental sequence classification, one can either realise a simultaneous alignment and assembly of reads using reference proteins or probabilistic protein sequence profiles [35,36,37] hoping to improve the sensitivity to detect significant matches, or can focus on annotating protein domains directly on sequencing reads [38]. With the production of larger and larger MG/MT datasets and the exploration of new environments (possibly gathering many unknown species), contig reconstruction might become even more challenging if realised without the help of domain annotation

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.