基于可持续更新模板库的蛋白质二级结构预测器

鹏杰 周

doi:10.12677/hjcb.2017.72002

Abstract

蛋白质二级结构预测是计算生物学研究的重要领域。虽然现有优秀的机器学习方法的预测准确度已经超过80%，但是它们都有共同的缺陷：不能及时学习最新实测的蛋白质结构信息，不能持续修改模型和参数，从而满足人们在日新月异时代对蛋白质二级结构预测的要求。本文构建了基于可持续更新模板库的蛋白质二级结构预测器：SIPSS。我们的新方法以同源序列的结构保守性为基本原理。首先我们建立了一个可持续更新的模板库，每月自动从蛋白质数据库中下载新测定的蛋白质结构数据，经过筛选将新的序列和结构信息补充进模板库。然后对于查询序列，用多重同源比对与模板库比对，得到新的变量：SPSSM变量。最后，我们以SPSSM为变量，用条件随机场建模和预测。实际测试表明，SIPSS能够在线学习新的蛋白质结构信息，对新近测定的蛋白质二级结构预测准确度(80.6%)明显高于现有的预测器。SIPSS网站：http://cheminfo.tongji.edu.cn/SIPSS/，可供用户免费使用。 Protein secondary structure prediction is an important field of computational biology. Although the accuracies of the existed state-of-the-art approaches are more than 80% but these methods have a common limitation. They couldn’t learn new structure knowledge of currently measured proteins, and couldn’t change the used model and their parameters. Thus, they couldn’t satisfy our expecting in the changing world. Here, we present a predictor of protein secondary structure based on a con-tinuously updated templet library: SIPSS. The basic stone of our approach is structural similarity based on sequence homology. First, a continuously updated templet library is constructed, which can automatically download the measured protein structure data from PDB per-month. After screening, the new information of protein sequences and structures are supplied into the template library. Then a query sequence is aligned against the template library by using PSI BLAST, and a new variable-SPSSM variable is obtained. Last, the SPSSM variable is used in a conditional random field algorithm for modelling and prediction. Our experiments showed that SIPSS can online learn new protein structure information and its prediction accuracy (80.6%) of protein secondary struc-ture measured in recent times is significantly better than the state-of-the art approaches. SIPSS is available free of charge at http://cheminfo.tongji.edu.cn/SIPSS/.

Full Text