Measuring Cluster Stability in a Large Scale Phylogenetic Analysis of Functional Genes in Metagenomes Using pplacer.

Tyler A Land,Robin B Kodner,Perry Fizzano

doi:10.1109/tcbb.2015.2446470

Abstract

Analysis of metagenomic sequence data requires a multi-stage workflow. The results of each intermediate step possess an inherent uncertainty and potentially impact the as-yet-unmeasured statistical significance of downstream analyses. Here, we describe our phylogenetic analysis pipeline which uses the pplacer program to place many shotgun sequences corresponding to a single functional gene onto a fixed phylogenetic tree. We then use the squash clustering method to compare multiple samples with respect to that gene. We approximate the statistical significance of each gene's clustering result by measuring its cluster stability, the consistency of that clustering result when the probabilistic placements made by pplacer are systematically reassigned and then clustered again, as measured by the adjusted Rand Index. We find that among the genes investigated, the majority of analyses are stable, based on the average adjusted Rand Index. We investigated properties of each gene that may explain less stable results. These genes tended to have less convex reference trees, less total reads recruited to the gene, and a greater Expected Distance between Placement Locations as given by pplacer when examined in aggregate. However, for an individual functional gene, these measures alone do not predict cluster stability.

Full Text