Abstract
Gapped k-mers frequency vectors (gkm-fv) has been presented for extracting sequence features. Coupled with support vector machine (gkm-SVM), gkm-fvs have been used to achieve effective sequence-based predictions. However, the huge computation of a large kernel matrix prevents it from using large amount of data. It is unclear how to combine gkm-fvs with other data sources in the context of string kernel. On the other hand, the high dimensionality, colinearity, and sparsity of gkm-fvs hinder the use of many traditional machine learning methods without a kernel trick. Therefore, we proposed a flexible and scalable framework gkm-DNN to achieve feature representation from high-dimensional gkm-fvs using deep neural networks (DNN). We first proposed a more concise version of gkm-fvs, which significantly reduce the dimension of gkm-fvs. Then, we implemented an efficient method to calculate the gkm-fv of a given sequence at the first time. Finally, we adopted a DNN model with gkm-fvs as inputs to achieve efficient feature representation and a prediction task. Here, we took the transcription factor binding site prediction as an illustrative application and applied gkm-DNN onto 467 small and 69 big human ENCODE ChIP-seq datasets to demonstrate its performance and compared it with the state-of-the-art method gkm-SVM.
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have