The rapid increase in available protein structure datasets requires new techniques for fast, yet, effective analysis of protein 3D structures. In this work, we propose a structure-based signature for protein families, suitable for rapid analysis of multidomain protein structures. Our method is alignment-free, using protein strings as the basic representation. A key novelty is the two-stage approach, whereby an initial list of candidate protein superfamilies are rapidly identified using the protein family signature, and then information retrieval methods are applied only to the members of the candidate superfamilies. This approach is the key to both improved speed, and improved structure retrieval accuracy. Experimental results, including comparative results with state-of-the-art methods, demonstrate the performance of the proposed protein family signature on queries with multidomain protein structures.
Read full abstract