Abstract
The advancements in next-generation sequencing technologies have given rise to large-scale, open-source protein databases consisting of hundreds of millions of sequences. However, to make these sequences useful in biomedical applications, they need to be painstakingly annotated by curating them from literature. To counteract this problem, many automated annotation algorithms have been developed over the years including deep learning models, especially in recent times. In this work, we propose a transformer-based deep-learning model that can predict the Enzyme Commission numbers of an enzyme from full-scale sequences with state-of-the-art accuracy compared to other recent machine learning annotation algorithms. The system does especially well on clustered split dataset which consists of training and testing samples derived from different distributions that are structurally dissimilar from each other. This proves that the model is able to understand deep patterns within the sequences and can accurately identify the motifs responsible for the different enzyme commission numbers. Moreover, the algorithm is able to retain similar accuracy even when the training size is significantly reduced, and also, the model accuracy is independent of the sequence length making it suitable for a wide range of applications consisting of varying sequence structures.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM transactions on computational biology and bioinformatics
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.