Classification of protein quaternary structure with support vector machine.

Shao-Wu Zhang,Hong-Cai Zhang,Hai-Yu Wang,Quan Pan,Yun-Long Zhang

doi:10.1093/bioinformatics/btg331

Abstract

Since the gap between sharply increasing known sequences and slow accumulation of known structures is becoming large, an automatic classification process based on the primary sequences and known three-dimensional structure becomes indispensable. The classification of protein quaternary structure based on the primary sequences can provide some useful information for the biologists. So a fully automatic and reliable classification system is needed. This work tries to look for the effective methods of extracting attribute and the algorithm for classifying the quaternary structure from the primary sequences. Both of the support vector machine (SVM) and the covariant discriminant algorithms have been first introduced to predict quaternary structure properties from the protein primary sequences. The amino acid composition and the auto-correlation functions based on the amino acid index profile of the primary sequence have been taken into account in the algorithms. We have analyzed 472 amino acid indices and selected the four amino acid indices as the examples, which have the best performance. Thus the five attribute parameter data sets (COMP, FASG, NISK, WOLS and KYTJ) were established from the protein primary sequences. The COMP attribute data set is composed of amino acid composition, and the FASG, NISK, WOLS and KYTJ attribute data sets are composed of the amino acid composition and the auto-correlation functions of the corresponding amino acid residue index. The overall accuracies of SVM are 78.5, 87.5, 83.2, 81.7 and 81.9%, respectively, for COMP, FASG, NISK, WOLS and KYTJ data sets in jackknife test, which are 19.6, 7.8, 15.5, 13.1 and 15.8%, respectively, higher than that of the covariant discriminant algorithm in the same test. The results show that SVM may be applied to discriminate between the primary sequences of homodimers and non-homodimers and the two protein sequence descriptors can reflect the quaternary structure information. Compared with previous Robert Garian's investigation, the performance of SVM is almost equal to that of the Decision tree models, and the methods of extracting feature vector from the primary sequences are superior to Robert's binning function method. Programs are available on request from the authors.

Full Text