The Parallel Maximal Cliques Algorithm for Protein Sequence Clustering

Nur'Aini Abdul Ras,Khalid Jaber,Rosni Abdullah

doi:10.3844/ajassp.2009.1368.1372

Abstract

Problem statement: Protein sequence clustering is a method used to discover relations between proteins. This method groups the proteins based on their common features. It is a core process in protein sequence classification. Graph theory has been used in protein sequence clustering as a means of partitioning the data into groups, where each group constitutes a cluster. Mohseni-Zadeh introduced a maximal cliques algorithm for protein clustering. Approach: In this study we adapted the maximal cliques algorithm of Mohseni-Zadeh to find cliques in protein sequences and we then parallelized the algorithm to improve computation times and allowed large protein databases to be processed. We used the N-Gram Hirschberg approach proposed by Abdul Rashid to calculate the distance between protein sequences. The task farming parallel program model was used to parallelize the enhanced cliques algorithm. Results: Our parallel maximal cliques algorithm was implemented on the stealth cluster using the C programming language and a hybrid approach that includes both the Message Passing Interface (MPI) library and POSIX threads (PThread) to accelerate protein sequence clustering. Conclusion: Our results showed a good speedup over sequential algorithms for cliques in protein sequences.

Highlights

One of the basic applications of protein sequence comparison is in protein sequence clustering
We extend this study to find multiple maximal cliques and we apply the Parallel Maximal Cliques Algorithm (PMCA) on the protein sequences taken from various protein databases
Experimental environment: The Parallel Maximal Cliques Algorithm is implemented on the Stealth cluster using the C programming language and a hybrid of the Message Passing Interface (MPI) library and POSIX threads (PThread), as mentioned previously

Summary

Introduction

One of the basic applications of protein sequence comparison is in protein sequence clustering. Protein sequence clustering is an element of protein sequence analysis. The two basic steps to protein sequence clustering include calculating distances among the protein sequences and grouping the sequences into groups of similar sequences based on these distances. We used a clustering algorithm based on a maximal clique proposed by Mohseni-Zadeh et al.[1]. Maximal cliques are used to find a cluster in a set of protein sequence graphs. We adapted the algorithm to find cliques of different sizes using the graphs. Relationships between protein sequences can readily be shown on a graph. Nodes or vertices in the graph represent protein sequences while each edge represents a relation between two vertices. The out-degree of each vertex is (n-1), where n is the number of vertices in the subset

Methods

Results

Conclusion