Abstract
MotivationDeep learning has become the dominant technology for protein contact prediction. However, the factors that affect the performance of deep learning in contact prediction have not been systematically investigated.ResultsWe analyzed the results of our three deep learning-based contact prediction methods (MULTICOM-CLUSTER, MULTICOM-CONSTRUCT and MULTICOM-NOVEL) in the CASP13 experiment and identified several key factors [i.e. deep learning technique, multiple sequence alignment (MSA), distance distribution prediction and domain-based contact integration] that influenced the contact prediction accuracy. We compared our convolutional neural network (CNN)-based contact prediction methods with three coevolution-based methods on 75 CASP13 targets consisting of 108 domains. We demonstrated that the CNN-based multi-distance approach was able to leverage global coevolutionary coupling patterns comprised of multiple correlated contacts for more accurate contact prediction than the local coevolution-based methods, leading to a substantial increase of precision by 19.2 percentage points. We also tested different alignment methods and domain-based contact prediction with the deep learning contact predictors. The comparison of the three methods showed deeper sequence alignments and the integration of domain-based contact prediction with the full-length contact prediction improved the performance of contact prediction. Moreover, we demonstrated that the domain-based contact prediction based on a novel ab initio approach of parsing domains from MSAs alone without using known protein structures was a simple, fast approach to improve contact prediction. Finally, we showed that predicting the distribution of inter-residue distances in multiple distance intervals could capture more structural information and improve binary contact prediction.Availability and implementation https://github.com/multicom-toolbox/DNCON2/.Supplementary information Supplementary data are available at Bioinformatics online.
Highlights
Evolutionary variation in protein sequences is constrained by protein function and structure
We analyzed the results of our three deep learning-based contact prediction methods (MULTICOMCLUSTER, MULTICOM-CONSTRUCT and MULTICOM-NOVEL) in the CASP13 experiment and identified several key factors [i.e. deep learning technique, multiple sequence alignment (MSA), distance distribution prediction and domain-based contact integration] that influenced the contact prediction accuracy
We demonstrated how the contact distance distribution prediction helped improve the performance of contact prediction and investigated how the number of effective sequences (Neff) in MSAs, MSA generation protocols and domain parsing method contributed to the contact prediction improvement
Summary
Evolutionary variation in protein sequences is constrained by protein function and structure. Observed correlated mutation patterns in the sequences of a protein family indicate the direct physical contact between residue pairs in its 3D structure (Altschuh et al, 1988), which can be used for inter-residue contact prediction (Gobel et al, 1994). An approximate 3D protein structure can be built with good accuracy if a sufficient amount of accurately predicted residue–residue contacts are available (Marks et al, 2011; Monastyrskyy et al, 2014). Due to the advancement in the DNA/RNA sequencing technology (Meyer et al, 2008; Wilke et al, 2016), a large number of sequences are available in public databases, making it possible for characterizing correlations between residue pairs of many proteins more accurately for contact prediction than before. Some of them may reflect the functional constraints without structural implication and some of them may be accidental indirect correlated mutations due to transitive effects (Weigt et al, 2009).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.