ProGen2: Exploring the boundaries of protein language models

Erik Nijkamp,Jeffrey A Ruffolo,Eli N Weinstein,Nikhil Naik,Ali Madani

doi:10.1016/j.cels.2023.10.002

Erik Nijkamp, Jeffrey A Ruffolo + Show 3 more

Open Access

https://doi.org/10.1016/j.cels.2023.10.002

Copy DOI

Abstract

Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

ProGen2: Exploring the boundaries of protein language models

Abstract

Talk to us

Similar Papers

More From: Cell Systems

Lead the way for us

Journal: Cell Systems	Publication Date: Oct 30, 2023
Citations: 119

Similar Papers

Efficient Exploration of Sequence Space by Sequence-Guided Protein Engineering and Design.
Ben E Clifton ... Dan Kozome
Biochemistry | VOL. 62
Ben E Clifton, et. al.Ben E Clifton ... Dan Kozome
04 Mar 2022
Biochemistry | VOL. 62

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning.
Hyebin Song ... Bennett J Bremer
Cell systems | VOL. 12
Hyebin Song, et. al.Hyebin Song ... Bennett J Bremer
18 Nov 2020
Cell systems | VOL. 12

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction.
Yang Qu ... Taowa Zhao
International Journal of Molecular Sciences | VOL. 24
Yang Qu, et. al.Yang Qu ... Taowa Zhao
18 Nov 2023
International Journal of Molecular Sciences | VOL. 24

Learning to Read and Write in the Language of Proteins
Helen T Hobbs ... Chang C Liu
GEN Biotechnology | VOL. 2
Helen T Hobbs, et. al.Helen T Hobbs ... Chang C Liu
01 Apr 2023
GEN Biotechnology | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ProGen2: Exploring the boundaries of protein language models

Abstract

Talk to us

Similar Papers

More From: Cell Systems