Towards simultaneous clustering and motif-modeling for a large number of protein family

Young Joon Yoo,Tushar Sandhan,Sun Kim,Jinyoung Choi

doi:10.1109/bibm.2013.6732605

Abstract

In this paper, we propose a novel clustering and motif modeling framework for analyzing large number of protein family using k-mer. Our approach of using k-mers utilizes both occurring frequency and position information of k-mers that essential for classification yet not fully used in previous methods. We found that the structure has close relationship between motif of protein family and hence well describe important biological features or motifs of each protein family. The classification/clustering procedure are executed in incremental manner which was difficult for previous algorithms and is modeled by using bipartite model. Furthermore, the method can be efficiently implemented using parallel computing and hash. Experimental results using the entire COG family database shows that our model can model a large number of protein families without sacrificing accuracy. In addition, the classification structure, path of the graph for protein sequences, explains characteristic subsequences or motif of each family quite well. Thus the proposed method has the potential to model both protein families and motifs, even for a large number of families.

Full Text