MLgsc: A Maximum-Likelihood General Sequence Classifier.

Thomas Junier,Pilar Junier,Vincent Hervé,Tina Wunderlin,I King Jordan

doi:10.1371/journal.pone.0129384

Thomas Junier, Pilar Junier + Show 3 more

Open Access

https://doi.org/10.1371/journal.pone.0129384

Copy DOI

Abstract

We present software package for classifying protein or nucleotide sequences to user-specified sets of reference sequences. The software trains a model using a multiple sequence alignment and a phylogenetic tree, both supplied by the user. The latter is used to guide model construction and as a decision tree to speed up the classification process. The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database. On this dataset, the software was shown to achieve an error rate of around 1% at genus level. Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented. The programs in the package have a simple, straightforward command-line interface for the Unix shell, and are free and open-source. The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.

Highlights

Reconstructing environmental communities of microorganisms often involves identifying lineages from a nucleotide or protein sequence
The Ribosomal Database Project (RDP) classifier [6] as well as SCIMM [7] and TACOA [8] belong to this category
MLgsc constructs a tree of position-specific weight matrices (PWMs) using a multiple alignment of sequences from the classifying region and a phylogenetic tree of the reference taxa (Fig 1)

Summary

Introduction

Reconstructing environmental communities of microorganisms often involves identifying lineages from a nucleotide or protein sequence. Phylogeny-based methods classify by placing the query in a phylogenetic tree along with references and examining its relatives. To this class belong, among others, EPA [9] and pplacer [10]. One alternative is to use a similarity-based method such as BLAST on a customized database of references Another is to train a gene-specific classifier on the gene of interest. It has a simple interface: training the model and using it to classify sequences are each performed as a single shell command involving at most a few arguments and options This makes it straightforward to include in shell-based classification (or other) pipelines. The MLgsc package consists of three programs: mlgsc_xval, mlgsc_train, and mlgsc, which perform cross-validation, training, and classifying, respectively (see Table 1)

Procedure

Method

Findings

Discussion

Conclusion