Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks.

Xiaogen Zhou,Wei Zheng,Yang Li,Yang Zhang,Rachel Kolodny,Chengxin Zhang,Dong-Jun Yu,Eric W Bell

doi:10.1371/journal.pcbi.1008865

Abstract

The topology of protein folds can be specified by the inter-residue contact-maps and accurate contact-map prediction can help ab initio structure folding. We developed TripletRes to deduce protein contact-maps from discretized distance profiles by end-to-end training of deep residual neural-networks. Compared to previous approaches, the major advantage of TripletRes is in its ability to learn and directly fuse a triplet of coevolutionary matrices extracted from the whole-genome and metagenome databases and therefore minimize the information loss during the course of contact model training. TripletRes was tested on a large set of 245 non-homologous proteins from CASP 11&12 and CAMEO experiments and outperformed other top methods from CASP12 by at least 58.4% for the CASP 11&12 targets and 44.4% for the CAMEO targets in the top-L long-range contact precision. On the 31 FM targets from the latest CASP13 challenge, TripletRes achieved the highest precision (71.6%) for the top-L/5 long-range contact predictions. It was also shown that a simple re-training of the TripletRes model with more proteins can lead to further improvement with precisions comparable to state-of-the-art methods developed after CASP13. These results demonstrate a novel efficient approach to extend the power of deep convolutional networks for high-accuracy medium- and long-range protein contact-map predictions starting from primary sequences, which are critical for constructing 3D structure of proteins that lack homologous templates in the PDB library.

Highlights

Protein structure prediction represents an important unsolved problem in computational biology, with the major challenge on distant-homology modeling [1,2]
We proposed a new deep learning architecture, TripletRes, built on a residual neural network protocol [29] to integrate a triplet of coevolutionary matrices features from pseudolikelihood maximization of Potts model, precision matrix and covariance matrix for high-accuracy contact-map prediction (Fig 1)
This work presented a new deep learning method for high-accuracy contact prediction by learning from raw coevolutionary features extracted with deep multiple sequence alignments

Summary

Introduction

Protein structure prediction represents an important unsolved problem in computational biology, with the major challenge on distant-homology modeling (or ab initio structure prediction) [1,2]. The idea of developing sequence-based contact-map prediction to assist ab initio protein structure prediction is, not new, which can be traced back to at least 25 years ago [7,8]. The methods for sequence-based protein contact-map prediction can be classified into two categories: coevolution analysis methods (CAMs) and machine learning methods (MLMs). DCA models demonstrated significant advantage over the local approaches, and essentially re-stimulated the interest of the field of protein structure prediction in contact-map predictions. The success of most DCA methods [11,12,13,14,15,16] is still limited for the proteins with few sequence homologs, because a shallow MSA significantly reduces the accuracy of DCA to derive the inherent correlated mutations. DCA models only capture linear relationships between residues on MSA data (S1 Text) while residue-residue relationships in proteins are inherently non-linear

Methods

Results

Conclusion