BackgroundCurrent methods for comparing metagenomes, derived from whole-genome sequencing reads, include top-down metrics or parametric models such as metagenome-diversity, and bottom-up, non-parametric, model-free machine learning approaches like Naïve Bayes for k-mer-profiling. However, both types are limited in their ability to effectively and comprehensively identify and catalogue unique or enriched metagenomic genes, a critical task in comparative metagenomics. This challenge is significant and complex due to its NP-hard nature, which means computational time grows exponentially, or even faster, with the problem size, rendering it impractical for even the fastest supercomputers without heuristic approximation algorithms. MethodIn this study, we introduce a new framework, MC (Metagenome-Comparison), designed to exhaustively detect and catalogue unique or enriched metagenomic genes (MGs) and their derivatives, including metagenome functional gene clusters (MFGC), or more generally, the operational metagenomic unit (OMU) that can be considered the counterpart of the OTU (operational taxonomic unit) from amplicon sequencing reads. The MC is essentially a heuristic search algorithm guided by pairs of new metrics (termed MG-specificity or OMU-specificity, MG-specificity diversity or OMU-specificity diversity). It is further constrained by statistical significance (P-value) implemented as a pair of statistical tests. ResultsWe evaluated the MC using large metagenomic datasets related to obesity, diabetes, and IBD, and found that the proportions of unique and enriched metagenomic genes ranged from 0.001% to 0.08 % and 0.08%–0.82 % respectively, and less than 10 % for the MFGC. ConclusionThe MC provides a robust method for comparing metagenomes at various scales, from baseline MGs to various function/pathway clusters of metagenomes, collectively termed OMUs.
Read full abstract