Protein sequence classification using feature hashing.

Cornelia Caragea,Prasenjit Mitra,Adrian Silvescu

doi:10.1186/1477-5956-10-s1-s14

Cornelia Caragea, Prasenjit Mitra + Show 1 more

Open Access

https://doi.org/10.1186/1477-5956-10-s1-s14

Copy DOI

Journal: Proteome Science	Publication Date: Jan 1, 2012
Citations: 64	License type: cc-by

Affiliation: Pennsylvania State University

Abstract

Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.

Highlights

Many problems in computational biology, e.g., protein function prediction, subcellular localization prediction, etc., can be formulated as sequence classification tasks [1], where the amino acid sequence of a protein is used to classify the protein in functional and localization classes.Protein sequence data contain intrinsic dependencies between their constituent elements
We study the applicability of feature hashing to protein sequence classification and address the following main questions: (i) How effective is feature hashing on prohibitively high dimensional k-gram representations?; (ii) What is the influence of the hash size (i. e., the reduced dimension) on the performance of protein sequence classifiers that use hash features, and what is the hash size at which the performance starts degrading, due to hash collisions?; and (iii) How does the performance of feature hashing compare to that of the “bag of k-grams” approach? The results of our experiments on three protein subcellular localization data sets show that feature hashing is effective at reducing dimensionality on protein sequence classification tasks
The performance of Support Vector Machine (SVM) trained on fixed-length k-gram representations is expected to be worse than that of their counterparts trained on variable length k-gram representations, as protein sequence motifs have usually variable length

Summary

Introduction

Many problems in computational biology, e.g., protein function prediction, subcellular localization prediction, etc., can be formulated as sequence classification tasks [1], where the amino acid sequence of a protein is used to classify the protein in functional and localization classes.Protein sequence data contain intrinsic dependencies between their constituent elements. Given a protein sequence x = x0, ..., xn-1 over the amino acid alphabet, the dependencies between neighboring elements can be modeled by generating all the contiguous (potentially overlapping) sub-sequences of a certain length k, xi-k, ..., xi-1, i = k, ..., n, called k-grams, or sequence motifs. A less expensive approach to dimensionality reduction is feature selection [5,6], which reduces the number of features by selecting a subset of the available features based on some chosen criteria. A new approach to dimensionality reduction, called feature hashing (or random clustering) has been introduced for text classification [8,9,10,11]. Feature hashing offers a very inexpensive, yet effective, approach to reducing the number of features provided as input to a learning algorithm, by allowing random collisions into the latent factors. Very effective for reducing the number of features from very high dimensions (e.g., 222) to midsize dimensions (e.g., 216), feature hashing can result in significant loss of information, especially when hash collisions occur between highly frequent features with significantly different class distributions

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Protein sequence classification using feature hashing.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proteome Science

Lead the way for us

Similar Papers

Protein Sequence Classification Using Feature Hashing
Cornelia Caragea ... Adrian Silvescu
-
Cornelia Caragea, et. al.Cornelia Caragea ... Adrian Silvescu
01 Nov 2011
01 Nov 2011

Semi-Supervised Learning for Classification of Protein Sequence Data
Brian R King ... Chittibabu Guda
Scientific Programming | VOL. 16
Brian R King, et. al.Brian R King ... Chittibabu Guda
01 Jan 2008
Scientific Programming | VOL. 16

Improved classifier for computer-aided polyp detection in CT colonography by nonlinear dimensionality reduction.
Shijun Wang ... Jianhua Yao
Medical physics | VOL. 35
Shijun Wang, et. al.Shijun Wang ... Jianhua Yao
13 Mar 2008
Medical physics | VOL. 35

A new algorithm based on complex wavelet transform for protein sequence classification
Li Liu ... Cheng Zhang
-
Li Liu, et. al.Li Liu ... Cheng Zhang
03 Aug 2012
03 Aug 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Protein sequence classification using feature hashing.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proteome Science