Abstract

GEN BiotechnologyVol. 2, No. 2 Views & NewsFree AccessLearning to Read and Write in the Language of ProteinsHelen T. Hobbs and Chang C. LiuHelen T. Hobbs*Address correspondence to: Helen T. Hobbs, Department of Biomedical Engineering, University of California, Irvine, Irvine, CA 92697, USA, E-mail Address: hthobbs@uci.eduDepartment of Biomedical Engineering, University of California, Irvine, Irvine, California, USA.Search for more papers by this author and Chang C. Liu*Address correspondence to: Chang C. Liu, Department of Biomedical Engineering, University of California, Irvine, Irvine, CA 92697, USA, E-mail Address: ccl@uci.eduDepartment of Biomedical Engineering, University of California, Irvine, Irvine, California, USA.Search for more papers by this authorPublished Online:18 Apr 2023https://doi.org/10.1089/genbio.2023.29086.hthAboutSectionsPDF/EPUB Permissions & CitationsPermissionsDownload CitationsTrack CitationsAdd to favorites Back To Publication ShareShare onFacebookTwitterLinked InRedditEmail Writing in Nature Biotechnology, Madani et al. describe a deep-learning language model that learns “grammatical” rules to write novel protein sequences with defined functions.Like the words in this paragraph, the order of amino acids in a protein conveys a specific idea. In their recent study,1 Ali Madani and colleagues from Salesforce Research and the University of California, San Francisco, describe a deep-learning language model that can write novel protein sequences with defined functions using the “grammatical” rules learned from natural protein data sets.The design of proteins with desired structures and functions is a long-standing goal in protein engineering. Unlocking the ability to design these cellular workhorses to achieve specific functions or folds has important implications in many fields, including medicine, biotechnology, and synthetic biology. Previous studies have applied a variety of strategies, for example, directed evolution,2,3de novo protein design,4,5 and coevolutionary analysis,6,7 to generate new protein sequences with specific functional and structural properties. Although powerful, these strategies are not easily generalized, requiring significant experimental effort, computational power, or the generation of extensive multiple sequence alignments for each protein family of interest.Inspiration for a more generalizable strategy has been found in the world of artificial intelligence. Natural language processing models combine linguistics and statistical learning to enable computers to understand the complete meaning of a text, including the author's intent.8,9 One such model, Chat-GPT, has recently made headlines with its astounding ability to generate human-like answers to almost any prompt.10The recognition that proteins are strings of nonrandom letters (just like the words in a paragraph) and that the order of those letters is intricately linked to the intent of the protein (i.e., its function) motivated the application of language models to protein sequences. When trained on a data set comprising natural proteins, such a model can learn the rules dictating protein sequence, just as it can learn the rules dictating language. However, to link sequence to intent, the training data set must include additional information.Inspired by the success of conditional language models in text generation, Madani et al. developed a natural language model called ProGen, capable of doing just that (Fig. 1). In a conditional model, sequence generation is controlled by control tags summarizing the intent. In the case of language models, these control tags can include the specific style or topic of the text. In the case of ProGen, the control tags encompassed taxonomic identifiers and various keywords associating each sequence with biological properties such as molecular function, cellular location, and Pfam ID (protein family). The training data set for ProGen contained 280 million nonredundant protein sequences and their associated keywords. The sequences, taxonomic identifiers, and keywords were all curated from publicly available databases (e.g., UniProtKB).FIG. 1. Application of a conditional language model to proteins. Madani et al. developed ProGen, a conditional language model that generates artificial protein sequences based on an input control tag.1 The model is trained on a large general data set of 280 million protein sequences, each linked to 1100 different control tags that associate the sequence with a specific property such as a protein family, cellular location, or molecular function. The model's performance can be further improved by additional training on a smaller curated data set of sequences associated with a single control tag, for example, their protein family. This fine-tuned model can then be asked to generate sequences predicted to have properties correlated with the input control tag(s). The diversity of the artificial sequences can also be controlled. (Created with BioRender.com.)In testing their model, the authors demonstrated that after initial training on the large data set of diverse protein families, ProGen performance was further improved by additional training on a smaller more focused data set—a process they refer to as fine tuning. The authors note that training on the larger data set alone may produce functional sequences toward a specific keyword, but that the success rate would likely be much lower than when fine tuned. Thus, the model first learns the general rules governing all proteins, and it then builds on those through fine tuning to further specialize toward a specific keyword. This process could be thought of as first learning the basic rules of semantics and then what makes a text about a specific topic, such as politics or science.Testing, testingThe authors chose five distinct families from the lysozyme group of enzymes as the first test of their model's performance. These families contain highly diverse sequences and multiple protein folds yet facilitate a similar biological function. A smaller data set of lysozyme sequences (∼55,000 sequences) was curated from all five families and used as the fine-tuning training data set for ProGen. The Pfam ID of each lysozyme family was used as an input control tag to generate 1 million artificial lysozyme sequences spanning the five natural lysozyme families.Representative sequences from both the natural and artificial lysozymes were expressed and purified using cell-free protein synthesis and affinity chromatography, with the majority well expressed in this system. When assayed for activity toward a specific lysozyme substrate, many of the artificial enzymes (73%) had measurable activity comparable with that observed for natural lysozymes.Excitingly, several of the ProGen-generated lysozymes had similar enzymatic activities to the most active natural lysozymes, suggesting ProGen can generate sequences as active as those that have been optimized over the course of natural evolution. Further characterization of a select set of artificial enzymes demonstrated that these retained the important biochemical and biophysical properties of the family, including the structural motifs related to substrate binding and the active site.One might argue that the highly active artificial lysozymes did not diverge enough from natural sequences to constitute a truly new sequence. For example, two highly active lysozymes, L056 and L070, were 69.6% and 89.2% identical to a natural sequence, respectively. To explore whether this argument holds, the authors took advantage of the fact that ProGen can be parameterized to generate sequences that are highly diverged from any natural sequence. This feature allowed for the generation of 95 sequences in the protein “twilight zone,” wherein the structure and function cannot be inferred from sequence similarity.11 Of these, most were successfully expressed, with about a third of those being soluble.Six of the soluble enzymes were purified and all were determined to be active (although with less catalytic activity than many of the natural and other artificial lysozymes). The artificial lysozyme with the lowest percent identity to any natural lysozyme (31%) had a catalytic activity 200-fold less than that of the standard egg white lysozyme, still impressive considering its significant divergence from any sequence optimized over the course of natural evolution. These folded and active sequences with almost no homology to any natural proteins may be exciting starting points for directed and continuous evolution methods aimed at engineering novel proteins.But is ProGen generalizable across protein families beyond lysozyme? The answer appears to be yes. The authors evaluated the performance of ProGen in generating functional sequences from two other protein families, where other methods, a statistical model based on multiple sequence alignments of related proteins12 and a distinct nonlanguage-based machine learning model,13 had already been applied. In both cases, ProGen was better at predicting which sequences would be functional. The generalizability of and ability to produce functional highly diverged sequences across many protein families and functions make ProGen an exciting development for protein engineering.For example, one could envision using the ProGen libraries of functional proteins different from any found in nature as a tool for engineering anti-immunogenic protein-based therapeutics. In addition, given the identification of functional “twilight” proteins, the sequence libraries generated by ProGen could reveal the true functional sequence space accessible to proteins with defined functions, rather than the relatively small sequence space explored by natural evolution.The key question that remains concerns the difference between generation and creation. Take the example of Chat-GPT. It can generate never-before-seen paragraphs that have a clear message, but does the message constitute an original idea that is meaningful and sensible? And even if the idea is not novel, how new or imaginative is its expression? Similar questions can be asked about ProGen's abilities, where new ideas might mean new-to-nature functions, and imaginative expressions might mean sequences that escape the evolutionary patterns found in natural protein families.One way forward may be by combining ideas, for example, by conditioning on multiple control tags not found together in nature: generate an enzyme that catalyzes reaction A and binds protein B. This would not only be a route to greater creativity but also immediately useful in service of applications requiring proteins with composite functions.Another route may be through fine tuning on a small set of protein sequences that happen to have an unnatural function, perhaps discovered through screening for unnatural promiscuous activities found in natural protein families. Finally, the use of “twilight” proteins generated by ProGen as starting points for directed or continuous evolution efforts may be a way to experimentally break free from the natural evolutionary ancestry of proteins, yielding new ones unlike any found in nature and perhaps with the potential for novel functions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call