Abstract

Conserved amino acids in sequences, which may be discovered as patterns across or along sequences, reveal functional domains within proteins. Conversely, less conserved amino acid sequences reveal areas of evolutionary divergence. Traditional protein classification trains patterns using pre-defined class labels (i.e. information about the input sequences such as gene name or family) in order to predict the class of novel sequences. However, these supervised algorithms may be inherently biased by such class dependent techniques. Therefore, we have created an unsupervised algorithm that is not affected by the inherent errors or class balance biases in the class labels. Our algorithm first discovers statistically significant sequence patterns, then aligns and clusters them into Aligned Pattern Clusters (APCs), which represent conserved amino acid sequences. APCs reveal sequence patterns (horizontal regions of amino acid homology), regions of conservation (vertical regions of amino acid homology), and regions of divergence (areas of vertical amino acid variation) within families of proteins. Finally, the algorithm verifies the results using two measures -- class entropy and class information gain -- both of which incorporate the class labels. The advantage of our method is that it does not require any a priori knowledge of a protein's structure or function. We applied our unsupervised algorithm to the class A Scavenger Receptor (cA-SR) protein family consisting of two distinct but related proteins, MARCO and SRAI. Using MARCO and SRAI as the class labels, we applied our class measures, class entropy and information gain. We found that class entropy revealed conservation of patterns and amino acids between sequences from all classes. The class information gain indicated which of these amino acids were found distinct to the MARCO class or the SRAI class, which allowed us to make important predictions as to the differing biological functions of these proteins.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call