Improvements in viral gene annotation using large language models and soft alignments

William L Harrigan,Barbra D Ferrell,K Eric Wommack,Shawn W Polson,Zachary D Schreiber,Mahdi Belcaid

doi:10.1186/s12859-024-05779-6

William L Harrigan, Barbra D Ferrell + Show 4 more

Open Access

https://doi.org/10.1186/s12859-024-05779-6

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Apr 25, 2024
License type: CC BY 4.0

Affiliation: University of Delaware

Abstract

BackgroundThe annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings.ResultsCentral to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect.ConclusionThe embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improvements in viral gene annotation using large language models and soft alignments

Abstract

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Functional Annotation of Proteins using Domain Embedding based Sequence Classification
Bishnu Sarker ... Sabeur Aridhi
-
Bishnu Sarker, et. al.Bishnu Sarker ... Sabeur Aridhi
01 Jan 2019
01 Jan 2019

Unlocking the Black Box? A Comprehensive Exploration of Large Language Models in Rehabilitation.
Bruno Bonnechère
American journal of physical medicine & rehabilitation | VOL. 103
Bruno BonnechèreBruno Bonnechère
12 Jan 2024
American journal of physical medicine & rehabilitation | VOL. 103

Unmatched sequences in public databases - exemplified by tuberculin-active protein.
H G Wiker
Scandinavian Journal of Immunology | VOL. 59
H G WikerH G Wiker
01 Apr 2004
Scandinavian Journal of Immunology | VOL. 59

Overview of HBV whole genome data in public repositories and the Chinese HBV reference sequences
Guanghua Wu ... Changqing Zeng
Progress in Natural Science | VOL. 18
Guanghua Wu, et. al.Guanghua Wu ... Changqing Zeng
05 Dec 2007
Progress in Natural Science | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improvements in viral gene annotation using large language models and soft alignments

Abstract

Talk to us

Similar Papers

More From: BMC Bioinformatics