COVID-19, caused by the highly contagious SARS-CoV-2 virus, is distinguished by its positive-sense, single-stranded RNA genome. A thorough understanding of SARS-CoV-2 pathogenesis is crucial for halting its proliferation. Notably, the 3C-like protease of the coronavirus (denoted as 3CLpro) is instrumental in the viral replication process. Precise delineation of 3CLpro cleavage sites is imperative for elucidating the transmission dynamics of SARS-CoV-2. While machine learning tools have been deployed to identify potential 3CLpro cleavage sites, these existing methods often fall short in terms of accuracy. To improve the performances of these predictions, we propose a novel analytical framework, the Transformer and Deep Forest Fusion Model (TDFFM). Within TDFFM, we utilize the AAindex and the BLOSUM62 matrix to encode protein sequences. These encoded features are subsequently input into two distinct components: a Deep Forest, which is an effective decision tree ensemble methodology, and a Transformer equipped with a Multi-Level Attention Model (TMLAM). The integration of the attention mechanism allows our model to more accurately identify positive samples, thus enhancing the overall predictive performance. Evaluation on a test set demonstrates that our TDFFM achieves an accuracy of 0.955, an AUC of 0.980, and an F1-score of 0.367, substantiating the model's superior prediction capabilities.
Read full abstract