Abstract

Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method incorporating Support Vector Machine (SVM) is one of the most effective methods. Many of SVM-based methods focus on finding useful representations of protein sequences, using either explicit feature vector representations or kernel functions. In this paper, we focuses on feature extraction and efficient representation of protein vectors for SVM protein classification. The experiment uses protein database from Structural Classification of Proteins version(SCOP) 1.53 with latent topic extraction technique (Latent Dirichlet Allocation model) which is an efficient feature extraction technique from natural language processing. The basic building blocks of our model are word documents generated from protein sequence by N-gram segmentation and filtered by TF-IDF method. Then the LDA phase applies on these documents for latent topic extraction while the SVM method acts as a classifier of latent topic. In our experiment, the LDA-SVM model outperforms than LSA-SVM model in the previous research.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call