Abstract
As the application of large language models (LLMs) has broadened into the realm of biological predictions, leveraging their capacity for self-supervised learning to create feature representations of amino acid sequences, these models have set a new benchmark in tackling downstream challenges, such as subcellular localization. However, previous studies have primarily focused on either the structural design of models or differing strategies for fine-tuning, largely overlooking investigations into the nature of the features derived from LLMs. In this research, we propose different ESM2 representation extraction strategies, considering both the character type and position within the ESM2 input sequence. Using model dimensionality reduction, predictive analysis and interpretability techniques, we have illuminated potential associations between diverse feature types and specific subcellular localizations. Particularly, the prediction of Mitochondrion and Golgi apparatus prefer segments feature closer to the N-terminal, and phosphorylation site-based features could mirror phosphorylation properties. We also evaluate the prediction performance and interpretability robustness of Random Forest and Deep Neural Networks with varied feature inputs. This work offers novel insights into maximizing LLMs' utility, understanding their mechanisms, and extracting biological domain knowledge. Furthermore, we have made the code, feature extraction API, and all relevant materials available at https://github.com/yujuan-zhang/feature-representation-for-LLMs.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.