Metagenome and Metatranscriptome Analyses Using Protein Family Profiles.

Cuncong Zhong,Jeffrey S Mclean,Shibu Yooseph,Anna Edlund,Youngik Yang,Arne Elofsson

doi:10.1371/journal.pcbi.1004991

Abstract

Analyses of metagenome data (MG) and metatranscriptome data (MT) are often challenged by a paucity of complete reference genome sequences and the uneven/low sequencing depth of the constituent organisms in the microbial community, which respectively limit the power of reference-based alignment and de novo sequence assembly. These limitations make accurate protein family classification and abundance estimation challenging, which in turn hamper downstream analyses such as abundance profiling of metabolic pathways, identification of differentially encoded/expressed genes, and de novo reconstruction of complete gene and protein sequences from the protein family of interest. The profile hidden Markov model (HMM) framework enables the construction of very useful probabilistic models for protein families that allow for accurate modeling of position specific matches, insertions, and deletions. We present a novel homology detection algorithm that integrates banded Viterbi algorithm for profile HMM parsing with an iterative simultaneous alignment and assembly computational framework. The algorithm searches a given profile HMM of a protein family against a database of fragmentary MG/MT sequencing data and simultaneously assembles complete or near-complete gene and protein sequences of the protein family. The resulting program, HMM-GRASPx, demonstrates superior performance in aligning and assembling homologs when benchmarked on both simulated marine MG and real human saliva MG datasets. On real supragingival plaque and stool MG datasets that were generated from healthy individuals, HMM-GRASPx accurately estimates the abundances of the antimicrobial resistance (AMR) gene families and enables accurate characterization of the resistome profiles of these microbial communities. For real human oral microbiome MT datasets, using the HMM-GRASPx estimated transcript abundances significantly improves detection of differentially expressed (DE) genes. Finally, HMM-GRASPx was used to reconstruct comprehensive sets of complete or near-complete protein and nucleotide sequences for the query protein families. HMM-GRASPx is freely available online from http://sourceforge.net/projects/hmm-graspx.

Highlights

Metagenomics (MG) and Metatranscriptomics (MT) are culture-independent methodologies [1,2] empowered by next-generation sequencing (NGS) technologies [3,4], which respectively enable genome and transcriptome profiling (RNA-seq) of the microbes in a given environment
The problem is further compounded by the current high-throughput sequencing technologies that generate short reads, which usually leads to short alignments that only contain partial information regarding the structural features of the protein
GRASP was designed to address these limitations by evaluating the sequence similarity between the query sequence and the reconstructed protein contigs, which resulted in longer alignments and improved accuracy [24]

Summary

Introduction

Metagenomics (MG) and Metatranscriptomics (MT) are culture-independent methodologies [1,2] empowered by next-generation sequencing (NGS) technologies [3,4], which respectively enable genome and transcriptome profiling (RNA-seq) of the microbes in a given environment. Identification of differentially expressed (DE) genes is based on comparisons of mRNA abundances across different conditions Both MG and MT approaches rely heavily on accurate estimation of the DNA/mRNA abundances in the sample. De novo assembly can be challenging due to uneven and/or lowcoverage of the constituent organisms, leading to fragmentary assembly for many data sets. These issues have been partly alleviated through the de novo short peptide assembly approach [20,21] that aims at reconstructing complete protein sequences, and is not hampered by synonymous DNA mutations

Methods

Results

Conclusion