Abstract

BackgroundThe exponential decrease in molecular sequencing cost generates unprecedented amounts of data. Hence, scalable methods to analyze these data are required. Phylogenetic (or Evolutionary) Placement methods identify the evolutionary provenance of anonymous sequences with respect to a given reference phylogeny. This increasingly popular method is deployed for scrutinizing metagenomic samples from environments such as water, soil, or the human gut.Novel methodsHere, we present novel and, more importantly, highly scalable methods for analyzing phylogenetic placements of metagenomic samples. More specifically, we introduce methods for (a) visualizing differences between samples and their correlation with associated meta-data on the reference phylogeny, (b) clustering similar samples using a variant of the k-means method, and (c) finding phylogenetic factors using an adaptation of the Phylofactorization method. These methods enable to interpret metagenomic data in a phylogenetic context, to find patterns in the data, and to identify branches of the phylogeny that are driving these patterns.ResultsTo demonstrate the scalability and utility of our methods, as well as to provide exemplary interpretations of our methods, we applied them to 3 publicly available datasets comprising 9782 samples with a total of approximately 168 million sequences. The results indicate that new biological insights can be attained via our methods.

Highlights

  • The availability of high-throughput DNA sequencing technologies has revolutionized biology by transforming it into an ever more data-driven and compute-intense discipline [1]

  • We present an adaptation of the Phylogenetic Isometric Log-Ratio (PhILR) transformation and balances [30] to phylogenetic placement data

  • In order to interpret what the axes of these principal components mean, we can again employ the visualization of Principal Components Analysis (PCA) eigenvectors on the reference tree as used in Edge PCA [29], c. f., S5 Fig. We show the results for PCA of the balances in S12 Fig. As with Edge PCA, the principal components correspond to the Lactobacillus clade, with the first component mostly separating Lactobacillus from the rest of the tree, and the second component further distinguishing between Lactobacillus crispatus and Lactobacillus iners

Read more

Summary

Introduction

The availability of high-throughput DNA sequencing technologies has revolutionized biology by transforming it into an ever more data-driven and compute-intense discipline [1]. Generation Sequencing (NGS) [2], as well as later generations [3,4,5,6], have given rise to novel methods for studying microbial environments [7,8,9,10] These technologies are often used in metagenomic studies to sequence organisms in water [11,12,13] or soil [14, 15] samples, in the human microbiome [16,17,18], and a plethora of other environments. Phylogenetic (or Evolutionary) Placement methods identify the evolutionary provenance of anonymous sequences with respect to a given reference phylogeny This increasingly popular method is deployed for scrutinizing metagenomic samples from environments such as water, soil, or the human gut. These methods enable to interpret metagenomic data in a phylogenetic context, to find patterns in the data, and to identify branches of the phylogeny that are driving these patterns

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.