Application of nonnegative matrix factorization to improve profile-profile alignment features for fold recognition and remote homolog detection

Soo-Young Lee,Inkyung Jung,Jaehyung Lee,Dongsup Kim

doi:10.1186/1471-2105-9-298

Abstract

BackgroundNonnegative matrix factorization (NMF) is a feature extraction method that has the property of intuitive part-based representation of the original features. This unique ability makes NMF a potentially promising method for biological sequence analysis. Here, we apply NMF to fold recognition and remote homolog detection problems. Recent studies have shown that combining support vector machines (SVM) with profile-profile alignments improves performance of fold recognition and remote homolog detection remarkably. However, it is not clear which parts of sequences are essential for the performance improvement.ResultsThe performance of fold recognition and remote homolog detection using NMF features is compared to that of the unmodified profile-profile alignment (PPA) features by estimating Receiver Operating Characteristic (ROC) scores. The overall performance is noticeably improved. For fold recognition at the fold level, SVM with NMF features recognize 30% of homolog proteins at > 0.99 ROC scores, while original PPA feature, HHsearch, and PSI-BLAST recognize almost none. For detecting remote homologs that are related at the superfamily level, NMF features also achieve higher performance than the original PPA features. At > 0.90 ROC50 scores, 25% of proteins with NMF features correctly detects remotely related proteins, whereas using original PPA features only 1% of proteins detect remote homologs. In addition, we investigate the effect of number of positive training examples and the number of basis vectors on performance improvement. We also analyze the ability of NMF to extract essential features by comparing NMF basis vectors with functionally important sites and structurally conserved regions of proteins. The results show that NMF basis vectors have significant overlap with functional sites from PROSITE and with structurally conserved regions from the multiple structural alignments generated by MUSTANG. The correlation between NMF basis vectors and biologically essential parts of proteins supports our conjecture that NMF basis vectors can explicitly represent important sites of proteins.ConclusionThe present work demonstrates that applying NMF to profile-profile alignments can reveal essential features of proteins and that these features significantly improve the performance of fold recognition and remote homolog detection.

Highlights

Nonnegative matrix factorization (NMF) is a feature extraction method that has the property of intuitive part-based representation of the original features
Performance comparison for fold recognition at the fold level we describe the fold recognition performance of support vector machines (SVM) with NMF features compared to that of profile-profile alignment (PPA) features, along with HHsearch and PSI-BLAST results
NMF improves the performance by roughly fifty folds at Receiver Operating Characteristic (ROC) score of > 0.90. These results indicate that NMF removes "noises" that may have originated from poor alignments or improper features in the original PPA method, providing enhancement of fold recognition performance

Summary

Introduction

Nonnegative matrix factorization (NMF) is a feature extraction method that has the property of intuitive part-based representation of the original features This unique ability makes NMF a potentially promising method for biological sequence analysis. Recent studies have shown that combining support vector machines (SVM) with profile-profile alignments improves performance of fold recognition and remote homolog detection remarkably It is not clear which parts of sequences are essential for the performance improvement. Due to the non-negativity constraint, the parts produced by NMF can be interpreted as subsets of elements that tend to occur together in sub-portion of the dataset [2] In this way, NMF can be applied to the multidimensional dataset in order to discover patterns and to help interpretation of large biological dataset. It can provide valuable information about the functional role and structure of unknown proteins

Methods

Results

Conclusion