Abstract

With the rise of transformers and large language models (LLMs) in chemistry and biology, new avenues for the design and understanding of therapeutics have been opened up to the scientific community. Protein sequences can be modeled as language and can take advantage of recent advances in LLMs, specifically with the abundance of our access to the protein sequence data sets. In this letter, we developed the GPCR-BERT model for understanding the sequential design of G protein-coupled receptors (GPCRs). GPCRs are the target of over one-third of Food and Drug Administration-approved pharmaceuticals. However, there is a lack of comprehensive understanding regarding the relationship among amino acid sequence, ligand selectivity, and conformational motifs (such as NPxxY, CWxP, and E/DRY). By utilizing the pretrained protein model (Prot-Bert) and fine-tuning with prediction tasks of variations in the motifs, we were able to shed light on several relationships between residues in the binding pocket and some of the conserved motifs. To achieve this, we took advantage of attention weights and hidden states of the model that are interpreted to extract the extent of contributions of amino acids in dictating the type of masked ones. The fine-tuned models demonstrated high accuracy in predicting hidden residues within the motifs. In addition, the analysis of embedding was performed over 3D structures to elucidate the higher-order interactions within the conformations of the receptors.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call