Abstract
The protein phosphorylation is one of the important processes in the cell signaling pathway. A variety of protein kinase families are involved in this process, and each kinase family phosphorylates different kinds of substrate proteins. Many methods to predict the kinase-specific phosphoryrated sites or different types of phosphorylated residues (Serine/Threonine or Tyrosin) have been developed. We employed Supprot Vector Machine (SVM) to attempt the prediction of protein kinase specific phosphorylation sites. 10 different kinds of protein kinase families (PKA, PKC, CK2, CDK, CaM-KII, PKB, MAPK, EGFR) were considered in this study. We defined 9 residues around a phosphorylated residue as a deterministic instance from which protein kinases determine whether they act on. The subsets of PSI-BALST profile was converted to the numerical vectors to represent postive or negative instances. When SVM training, We took advantage of multiple SVMs because of the unbalanced training sets. Representative negative instances were drawn multiple times, and generated new traing sets with the same positive instances in the original traing set. When testing, the final decisions were made by the votes of those multiple SVMs. Generally, RBF kernel was used for the SVMs, and several parameters such as gamma and cost factor were tested. Our approach achieved more than 90% specificity throughout the protein kinase families, while the sensitivities recorded 60% on average. Corresponding author: Dongsup Kim (Tel: +82-42-869-4317, Fax:+82-42-869-4310, Email: kds@kaist.ac.kr) This work is supported by CHUNG Moon Soul center for BioInformaion and BioElectronics (CMSC). Introduction Proteins are usually phosphorylated on their specific residues such as Serine, Threonine, and Tyrosin after their synthesis. The phosphorylation plays crucial roles in a variety of biological cellular processes, including transcription, translation, cell cycle and signal transduction. If potential phosphorylated sites and involved protein kinases could be revealed, it would greatly help extend our knowledge on the biological cellular processes. Already known phosphorylation sites can be divided into the sites with known protein kinases acting on them and the sites with no such information available. Many researches on the prediction of phosphorylation sites have been done. Some of them focused on the specific substrate residues (prediction for Serine/Threonine or Tyrosin), while others approached in terms of protein kinase family or group (prediction for the sites catalyzed by PKA or CDK). Generally, local sequence patterns (consensus sequences or motifs) and profiles were used. Sequence patterns are derived by aligning the local regions of proteins that contain phosphorylated residues. In the profile method, pre-compiled profile (or weight matrix) is compared with a target protein sequence, and similarity score is driven. The profile is constructed by aligning only phosphorylated sequences. Scansite (Yaffe et al., 2001) utilized this profile approach, and correctly predicted ~70% of known phosphorylation sites in PhosphoBase. On the other hand, machine learning techniques also have been implemented. NetPhos(Blom N et al., 1999) is implemented in the artificial neural networks (ANNs) with the consensus-motif-based approaches. The improved version, NetPhosK can perform PK-specific predictions as well. Support vector machine (SVM)-based method was also developed and implemented in PredPhospho (J.H.Kim et al., 2004). PredPhospho can predict the phosphorylation sites by 8 kinds of different protein kinase families and groups, and performs well both in specificity and sensitivity. It attempted to optimize the system by identifying SVM parameters such as gamma and penalty parameter, kernel type, and window size that maximize the performance. Here, we also attempted the PK-specific phosphorylation sites prediction with the SVM. We used a subset of Psi-Blast profile as features to include the evolutionary information. In addition, decisions were made by the votes of multiple SVMs that were trained with different negative instances. This guarantees relatively higher specificity than the sensitivity, because the system can experience many different negative instances by constructing multiple SVMs when training. Bioinformatics and Biosystems 2007, Vol. 2, No. 1, pp. 30-34 31 PK Family Num. of Positives Num. of Negatives Ratio PKA_group 254 14622 1:58 PKC_group 249 10584 1:43
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.