Evaluating large language models for annotating proteins.

Rosario Vitale,Leandro A Bugnon,Emilio Luis Fenoy,Diego H Milone,Georgina Stegmayer

doi:10.1093/bib/bbae177

Abstract

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningṪhis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evaluating large language models for annotating proteins.

Abstract

Talk to us

Similar Papers

More From: Briefings in Bioinformatics

Lead the way for us

Journal: Briefings in Bioinformatics	Publication Date: Mar 27, 2024
License type: cc-by

Similar Papers

Highly Efficient Parts of Speech Tagging in Low Resource Languages with Improved Hidden Markov Model and Deep Learning
Diganta Baishya ... Rupam Baruah
International Journal of Advanced Computer Science and Applications | VOL. 12
Diganta Baishya, et. al.Diganta Baishya ... Rupam Baruah
01 Jan 2020
International Journal of Advanced Computer Science and Applications | VOL. 12

METAPLANTCODE: Harmonizing Plant Metabarcoding Pipelines in Europe
Auguste Gardette ... Jean-Daniel Zucker
Biodiversity Information Science and Standards | VOL. 8
Auguste Gardette, et. al.Auguste Gardette ... Jean-Daniel Zucker
28 Aug 2024
Biodiversity Information Science and Standards | VOL. 8

DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models
Arash Dargahi Nobari ... Davood Rafiei
Proceedings of the ACM on Management of Data | VOL. 2
Arash Dargahi Nobari, et. al.Arash Dargahi Nobari ... Davood Rafiei
12 Mar 2024
Proceedings of the ACM on Management of Data | VOL. 2

DLAP: A Deep Learning Augmented Large Language Model Prompting framework for software vulnerability detection
Yanjing Yang ... He Zhang
The Journal of Systems & Software | VOL. 219
Yanjing Yang, et. al.Yanjing Yang ... He Zhang
18 Oct 2024
The Journal of Systems & Software | VOL. 219

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating large language models for annotating proteins.

Abstract

Talk to us

Similar Papers

More From: Briefings in Bioinformatics