Abstract

MotivationModeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function predictions, as well as for protein design. In particular, direct coupling analysis, a method to infer effective pairwise interactions between residues, was shown to capture important structural constraints and to successfully generate functional protein sequences. Building on this and other graphical models, we introduce a new framework to assess the quality of the secondary structures of the generated sequences with respect to reference structures for the family.ResultsWe introduce two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure, called Dot Product and Pattern Matching. We test these scores on published experimental protein mutagenesis and design dataset, and show improvement in the detection of nonfunctional sequences. We also show that use of these scores help rejecting nonfunctional sequences generated by graphical models (Restricted Boltzmann Machines) learned from homologous sequence alignments.Availability and implementationData and code available at https://github.com/CyrilMa/ssqaSupplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Considerable efforts were devoted over the past decade to the modelling of protein families from homologous sequence data, taking advantage of the tens of millions of available sequences in databases such as Uniprot (The UniProt Consortium (2019)) or PFAM (Bateman et al (2002))

  • Direct Coupling Analysis (DCA) defines a likelihood over the sequence space, which can be used to predict the effects of mutations to a natural sequence in comparison to mutagenesis experiments (Figliuzzi et al (2015), Hopf et al (2017)), or can be sampled to design de novo synthetic proteins, whose viability can be assessed in vivo (Russ et al (2020))

  • Russ et al (2020) designed new protein sequences from the DCA model learned from homologous sequences of the Chorismate Mutase enzyme (PF07736 in PFAM, alignment referred to as natural sequences (NAT)), and measured their fitnesses in vivo, see section 4.1

Read more

Summary

Introduction

Considerable efforts were devoted over the past decade to the modelling of protein families from homologous sequence data, taking advantage of the tens of millions of available sequences in databases such as Uniprot (The UniProt Consortium (2019)) or PFAM (Bateman et al (2002)). DCA defines a likelihood over the sequence space, which can be used to predict the effects of mutations to a natural sequence in comparison to mutagenesis experiments (Figliuzzi et al (2015), Hopf et al (2017)), or can be sampled to design de novo synthetic proteins, whose viability can be assessed in vivo (Russ et al (2020)). Despite these successes it remains unclear what aspects of the structural, functional and evolutionary constraints acting on protein sequences are adequately captured by such sequencebased models, and, what features are inappropriately accounted for. DCA was shown to be successful for extracting structural information about the 3D conformation of the protein and for designing new functional proteins through the sampling of P(x) (see Russ et al (2005) and Russ et al (2020))

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.