Bacterial promoters play a crucial role in gene expression by serving as docking sites for the transcription initiation machinery. However, accurately identifying promoter regions in bacterial genomes remains a challenge due to their diverse architecture and variations. In this study, we propose MLDSPP (Machine Learning and Duplex Stability based Promoter prediction in Prokaryotes), a machine learning-based promoter prediction tool, to comprehensively screen bacterial promoter regions in 12 diverse genomes. We leveraged biologically relevant and informative DNA structural properties, such as DNA duplex stability and base stacking, and state-of-the-art machine learning (ML) strategies to gain insights into promoter characteristics. We evaluated several machine learning models, including Support Vector Machines, Random Forests, and XGBoost, and assessed their performance using accuracy, precision, recall, specificity, F1 score, and MCC metrics. Our findings reveal that XGBoost outperformed other models and current state-of-the-art promoter prediction tools, namely Sigma70pred and iPromoter2L, achieving F1-scores >95% in most systems. Significantly, the use of one-hot encoding for representing nucleotide sequences complements these structural features, enhancing our XGBoost model's predictive capabilities. To address the challenge of model interpretability, we incorporated explainable AI techniques using Shapley values. This enhancement allows for a better understanding and interpretation of the predictions of our model. In conclusion, our study presents MLDSPP as a novel, generic tool for predicting promoter regions in bacteria, utilizing original downstream sequences as nonpromoter controls. This tool has the potential to significantly advance the field of bacterial genomics and contribute to our understanding of gene regulation in diverse bacterial systems.
Read full abstract