AbstractThe emergence of self‐supervised deep language models has revolutionized natural language processing tasks and has recently extended its applications to biological sequence analysis. Traditional language models, primarily based on Transformer architectures, demonstrate substantial effectiveness in various applications. However, these models are inherently constrained by the attention mechanism's quadratic computational complexity, , which limits their efficiency and leads to high computational costs. To address these limitations, we introduce ProtHyena, a novel approach that leverages the Hyena operator in protein language modeling. This innovative methodology alternates between subquadratic long convolutions and element‐wise gating operations, which circumvents the constraints imposed by attention mechanisms and reduces computational complexity to subquadratic levels. This enables faster and more memory‐efficient modeling of protein sequences. ProtHyena can achieve state‐of‐the‐art results and comparable performance in 8 downstream tasks, including protein engineering (protein fluorescence and stability prediction), protein property prediction (neuropeptide cleavage, signal peptide, solubility, disorder, gene function prediction), protein structure prediction, with only 1.6 M parameters. The architecture of ProtHyena represents a highly efficient solution for protein language modeling, offering a promising avenue for fast and efficient analysis of protein sequences.
Read full abstract