Abstract

Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.

Highlights

  • Many problems in computational biology, e.g., protein function prediction, subcellular localization prediction, etc., can be formulated as sequence classification tasks [1], where the amino acid sequence of a protein is used to classify the protein in functional and localization classes.Protein sequence data contain intrinsic dependencies between their constituent elements

  • We study the applicability of feature hashing to protein sequence classification and address the following main questions: (i) How effective is feature hashing on prohibitively high dimensional k-gram representations?; (ii) What is the influence of the hash size (i. e., the reduced dimension) on the performance of protein sequence classifiers that use hash features, and what is the hash size at which the performance starts degrading, due to hash collisions?; and (iii) How does the performance of feature hashing compare to that of the “bag of k-grams” approach? The results of our experiments on three protein subcellular localization data sets show that feature hashing is effective at reducing dimensionality on protein sequence classification tasks

  • The performance of Support Vector Machine (SVM) trained on fixed-length k-gram representations is expected to be worse than that of their counterparts trained on variable length k-gram representations, as protein sequence motifs have usually variable length

Read more

Summary

Introduction

Many problems in computational biology, e.g., protein function prediction, subcellular localization prediction, etc., can be formulated as sequence classification tasks [1], where the amino acid sequence of a protein is used to classify the protein in functional and localization classes.Protein sequence data contain intrinsic dependencies between their constituent elements. Given a protein sequence x = x0, ..., xn-1 over the amino acid alphabet, the dependencies between neighboring elements can be modeled by generating all the contiguous (potentially overlapping) sub-sequences of a certain length k, xi-k, ..., xi-1, i = k, ..., n, called k-grams, or sequence motifs. A less expensive approach to dimensionality reduction is feature selection [5,6], which reduces the number of features by selecting a subset of the available features based on some chosen criteria. A new approach to dimensionality reduction, called feature hashing (or random clustering) has been introduced for text classification [8,9,10,11]. Feature hashing offers a very inexpensive, yet effective, approach to reducing the number of features provided as input to a learning algorithm, by allowing random collisions into the latent factors. Very effective for reducing the number of features from very high dimensions (e.g., 222) to midsize dimensions (e.g., 216), feature hashing can result in significant loss of information, especially when hash collisions occur between highly frequent features with significantly different class distributions

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.