The prediction of peptide amyloidogenesis is a challenging problem in the field of protein folding. Large language models, such as the ProtBERT model, have recently emerged as powerful tools in analyzing protein sequences for applications, such as predicting protein structure and function. In this article, we describe the use of a semisupervised and fine-tuned ProtBERT model to predict peptide amyloidogenesis from sequences alone. Our approach, which we call AggBERT, achieved state-of-the-art performance, demonstrating the potential for large language models to improve the accuracy and speed of amyloid fibril prediction over simple heuristics or structure-based approaches. This work highlights the transformative potential of machine learning and large language models in the fields of chemical biology and biomedicine.
Read full abstract