Abstract

In protein sequence classification research, it is popular to convert a variable length sequence of protein into a fixed length numerical vector by using various descriptors, for instance, composition of k-mer composition. Such position-independent descriptors are useful since they are applicable to any length of sequence; however, positional information of subsequence is discarded even though it might have high contribution to classification performance. To solve this problem, we divided the original sequence into some segments, and then calculated the numerical features for them. It enables us to partially introduce positional information (for instance, compositions of serine in anterior and posterior segments of a sequence). Through comprehensive experiments on the number of segments and length of overlapping region, we found our classification approach with sequence segmentation and feature selection is effective to improve the performance. We evaluated our approach on three protein classification problems and achieved significant improvement in all cases which have a dataset with sufficient amino acid in each sequence. This result has shown the great potential of using additional segments in protein sequence classification to solve other sequence problems in bioinformatics.

Highlights

  • Protein sequence is an essential asset in protein classification research

  • In order to show the validity of our proposed approach to improving existing alignment-free protein descriptor to deal with a protein sequence classification problem, we did experiments with datasets from UniProt, Swiss-Prot, and Nuclea RDB

  • As done by Bhasin and Gajendra [2], the classification was achievedon the basis of amino acid composition and dipeptide composition from a sequence of nuclear receptors using support vector machine (SVM).They did training and testing on a non-redundant dataset of 282 proteins obtained from the NucleaRDB database

Read more

Summary

Introduction

This process is called feature extraction and it is a critical step because the selection of the effective and appropriate type of feature extraction will profoundly affect classification performance. Xiao et al [1] grouped the types of commonly used descriptors into eight groups such as Amino Acid Composition, Autocorrelation, CTD, Conjoint Triad, Quasi-Sequence-Order, Pseudo-Amino Acid Composition, Proteochemometric descriptors, and PSSM. These groups have 22 type descriptors that have been actively used in researches

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call