Abstract

Abstract INTRODUCTION: Our understanding of gene properties has advanced through representation learning such as Alpha fold. Representation learning involves encoding the relationships between genes by embedding them into a numerical space. These embeddings, which capture complex genetic interactions and characteristics, can then be leveraged by machine learning models to predict various gene properties. Current embeddings derive from transcriptome or sequence data. Over the past 150 years, numerous experimental assays have uncovered gene functions and interactions- comprehensive knowledge documented in the literature but not always evident in transcriptome or sequence data. It has been posited that leveraging this knowledge to create gene embeddings; however, could result in machine learning models biased towards well-studied genes. METHODS AND RESULTS: We tested this hypothesis by developing a novel knowledge-embedding framework, GeneLLM. During training, GeneLLM learns to comprehend summaries of every gene- a compressed form of published knowledge- using Large Language Models (LLMs), fine-tuned for downstream tasks mapping cellular properties and biochemical processes. Despite the expected bias towards well-known genes, GeneLLM surprisingly showed high predictive power for an array of gene properties. Compared to baseline models, GeneLLM boosted an increase in performance of 20.3% correlation in gene conservation across species, 8.6% and 57.2% prediction accuracy in subcellular localization and gene ontology respectively. GeneLLM also showed competitive results on solubility prediction with 0.91 accuracy and a correlation of 0.71 for tissue-specific expression levels for 1001 cell lines. We also showed that the bias toward well-known genes could be mitigated by combining GeneLLM representation with transcriptome or sequence-based embedding. The combined embeddings exhibited superior performance to their individual components which suggests that GeneLLM extracts views complementary to existing embedding methods. CONCLUSION: The GeneLLM framework demonstrates the ability of LLMs to extract information from the rich knowledge available about the nexus of genes and their cellular traits. It also illustrates how bias in knowledge representation is complementary to other transcriptome and sequence-based information. This ability of GeneLLM to advance our understanding of genes, their roles in cellular processes, and their impact on oncogenesis, as well as in response and resistance mechanisms, highlights its potential in cancer research. Citation Format: Ala Jararweh, Kushal Virupakshappa, Oladimeji S. Macaulay, Aaron Segura, Olufunmilola M. Oyebamiji, Yue Hu, Avinash D. Sahu. GeneLLM: Unveiling gene functions through literature-driven transformer embeddings [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 3534.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call