ProtEC: A Transformer Based Deep Learning System for Accurate Annotation of Enzyme Commission Numbers.

Azwad Tamir,Milad Salem,Jiann-Shiun Yuan

doi:10.1109/tcbb.2023.3311427

Abstract

The advancements in next-generation sequencing technologies have given rise to large-scale, open-source protein databases consisting of hundreds of millions of sequences. However, to make these sequences useful in biomedical applications, they need to be painstakingly annotated by curating them from literature. To counteract this problem, many automated annotation algorithms have been developed over the years including deep learning models, especially in recent times. In this work, we propose a transformer-based deep-learning model that can predict the Enzyme Commission numbers of an enzyme from full-scale sequences with state-of-the-art accuracy compared to other recent machine learning annotation algorithms. The system does especially well on clustered split dataset which consists of training and testing samples derived from different distributions that are structurally dissimilar from each other. This proves that the model is able to understand deep patterns within the sequences and can accurately identify the motifs responsible for the different enzyme commission numbers. Moreover, the algorithm is able to retain similar accuracy even when the training size is significantly reduced, and also, the model accuracy is independent of the sequence length making it suitable for a wide range of applications consisting of varying sequence structures.

Full Text