Fizzy: feature subset selection for metagenomics.

Gregory Ditzler,Gail L Rosen,Yemin Lan,J Calvin Morrison

doi:10.1186/s12859-015-0793-8

Gregory Ditzler, Gail L Rosen + Show 2 more

Open Access

https://doi.org/10.1186/s12859-015-0793-8

Copy DOI

Journal: BMC bioinformatics	Publication Date: Nov 4, 2015
Citations: 66	License type: CC BY 4.0

Affiliation: University of Arizona, Drexel University

Abstract

BackgroundSome of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α– & β–diversity. Feature subset selection – a sub-field of machine learning – can also provide a unique insight into the differences between metagenomic or 16S phenotypes. In particular, feature subset selection methods can obtain the operational taxonomic units (OTUs), or functional features, that have a high-level of influence on the condition being studied. For example, in a previous study we have used information-theoretic feature selection to understand the differences between protein family abundances that best discriminate between age groups in the human gut microbiome.ResultsWe have developed a new Python command line tool, which is compatible with the widely adopted BIOM format, for microbial ecologists that implements information-theoretic subset selection methods for biological data formats. We demonstrate the software tools capabilities on publicly available datasets.ConclusionsWe have made the software implementation of Fizzy available to the public under the GNU GPL license. The standalone implementation can be found at http://github.com/EESI/Fizzy.

Highlights

Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α– & β–diversity
Microbial ecologists represent the composition in the form of an abundance matrix, which usually holds counts of operational taxonomic units (OTUs), but can hold counts of genes/metabolic pathway occurrences if the data are collected from whole genome shotgun (WGS)
Feature subset selection provides an avenue for rapid insight to the taxonomic or functional differences that can be found between different metagenomic or 16S phenotypes in an environmental study

Summary

Introduction

Some of the current software tools for comparative metagenomics provide ecologists with the ability to investigate and explore bacterial communities using α– & β–diversity. Sequences from bacterial communities are collected from whole genome shotgun (WGS), or amplicon sequencing runs, and the analysis of such data allows researchers to study the functional or taxonomic composition of a sample. Microbial ecologists represent the composition in the form of an abundance matrix, which usually holds counts of operational taxonomic units (OTUs), but can hold counts of genes/metabolic pathway occurrences if the data are collected from WGS. Subset selection can produce a feature subset that removes irrelevant features (i.e., features that do not carry information about the phenotype), and does not contain features that are redundant (i.e., features carry the same information) This process of reducing the feature set offers a rapid insight into uncovering the differences between multiple populations in a metagenomic study and can be performed as complementary analysis to β-diversity methods, such as PCoA. Feature selection has been performed previously, by tools such as Random Forests [8], and Lefse [9], but is usually tied to a classification type or effect size

Methods

Results

Conclusion