HIPPI: highly accurate protein family classification with ensembles of HMMs.

Nam-Phuong Nguyen,Michael Nute,Tandy Warnow,Siavash Mirarab

doi:10.1186/s12864-016-3097-0

Abstract

BackgroundGiven a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics.ResultsWe present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy.ConclusionHIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-016-3097-0) contains supplementary material, which is available to authorized users.

Highlights

Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others
Techniques for protein family classification and gene binning operate in two basic steps: first, the query sequence is compared to each family in a published database and the probability of membership in the family is assessed; the family with the highest probability is returned for that query sequence, provided the probability is above a required minimum threshold
We present a comparison between Hierarchical profile HMMs for protein family identification (HIPPI), HMMER, blastp, and the HHblits+HHsearch pipeline for the problem of protein family identification using the Pfam-A database of protein families [13]

Summary

Introduction

Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. The assignment of newly obtained molecular sequences to gene families or protein families and superfamilies is a fundamental step in many bioinformatics analyses. One of the simplest methods for homology detection is BLAST [9], including variations designed for proteins, such as blastp and PSI-BLAST [10]. While these sequence similarity-based approaches have good accuracy in many conditions, they can have poor accuracy when classifying query sequences that have low sequence similarity to all the sequences in the reference database [11]

Objectives

Methods

Results

Conclusion