Abstract

In this paper, we propose a novel clustering and motif modeling framework for analyzing large number of protein family using k-mer. Our approach of using k-mers utilizes both occurring frequency and position information of k-mers that essential for classification yet not fully used in previous methods. We found that the structure has close relationship between motif of protein family and hence well describe important biological features or motifs of each protein family. The classification/clustering procedure are executed in incremental manner which was difficult for previous algorithms and is modeled by using bipartite model. Furthermore, the method can be efficiently implemented using parallel computing and hash. Experimental results using the entire COG family database shows that our model can model a large number of protein families without sacrificing accuracy. In addition, the classification structure, path of the graph for protein sequences, explains characteristic subsequences or motif of each family quite well. Thus the proposed method has the potential to model both protein families and motifs, even for a large number of families.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.