A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

Bin Liu,Lei Lin,Xiaolong Wang,Qiwen Dong,Xuan Wang

doi:10.1186/1471-2105-9-510

Bin Liu, Lei Lin + Show 3 more

Open Access

https://doi.org/10.1186/1471-2105-9-510

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Dec 1, 2008
Citations: 182	License type: CC BY 2.0

Affiliation: Harbin Institute of Technology

Abstract

BackgroundProtein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences.ResultsIn this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods.ConclusionThe method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.

Highlights

Protein remote homology detection and fold recognition are central problems in bioinformatics
In this study we present a novel building block of proteins called Top-n-grams to use the evolutionary information of the protein sequence frequency profiles and apply this novel building block to remote homology detection and fold recognition
We present a novel representation of protein sequences based on Top-n-grams and apply the latent semantic analysis to improve the prediction performance of both protein remote homology detection and fold recognition

Summary

Introduction

Protein remote homology detection and fold recognition are central problems in bioinformatics. Some heuristic algorithms, such as BLAST [3] and FASTA [4] trade reduced accuracy for improved efficiency These methods do not perform well for remote homology detection, because the alignment score falls into a twilight zone when the protein sequences similarity is below 35% at the amino acid level [5]. These methods such as profile hidden Markov model (HMM) [7] can be trained iteratively in a semi-supervised manner using both positively labeled and unlabeled samples of a particular family by pulling in close homology and adding them to the positive set [8] The discriminative algorithms such as Support Vector Machines (SVM) [9] provide state-of-theart performance. Another approach is the feature-space-based kernel, which chooses a proper feature space, represents each sequence as a vector in that space and inner product (or a function derived from it) between these vector-space representations is taken as a kernel for the sequences [10]

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Protein Remote Homology Detection and Fold Recognition based on Features Extracted from Frequency Profiles
Lei Lin ... Buzhou Tang
Journal of Computers | VOL. 6
Lei Lin, et. al.Lei Lin ... Buzhou Tang
02 Jan 2011
Journal of Computers | VOL. 6

Application of latent semantic analysis to protein remote homology detection
Qi-Wen Dong ... Xiao-Long Wang
Bioinformatics | VOL. 22
Qi-Wen Dong, et. al.Qi-Wen Dong ... Xiao-Long Wang
29 Nov 2005
Bioinformatics | VOL. 22

Protein Fold Recognition and Remote Homology Detection Based on Profile-Level Building Blocks
Lei Lin ... Yi Shen
-
Lei Lin, et. al.Lei Lin ... Yi Shen
01 Apr 2010
01 Apr 2010

Remote protein homology detection and fold recognition using two-layer support vector machine classifiers
Hilmi M Muda ... Razib M Othman
Computers in Biology and Medicine | VOL. 41
Hilmi M Muda, et. al.Hilmi M Muda ... Razib M Othman
25 Jun 2011
Computers in Biology and Medicine | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics