Protein coding regions prediction by fusing DNA shape features

Miao Chen,Yangyang Li,Kun Zhang,Hao Liu

doi:10.1016/j.nbt.2023.12.006

Abstract

Exons crucial for coding are often hidden within introns, and the two tend to vary greatly in length, which results in deep learning-based protein coding region prediction methods often performing poorly when applied to more structurally complex biological genomes. DNA shape information also plays a role in revealing the underlying logic of gene expression, yet current methods ignore the influence of DNA shape features when distinguishing coding and non-coding regions. We propose a method to predict protein-coding regions using the CNNS-BRNN model, which incorporates DNA shape features and improves the model's ability to distinguish between intronic and exonic features. We use a fusion coding technique that combines DNA shape features and traditional sequence features. Experiments show that this method outperforms the baseline method in metrics such as AUC and F1 by 2.3% and 5.3%, respectively, and the fusion coding method that introduces DNA shape features has a significant improvement in model performance.

Full Text