Abstract

Polyadenylation signals (PAS) are found in most protein-coding and some non-coding genes in eukaryotes. Their accurate recognition improves understanding gene regulation mechanisms and recognition of the 3'-end of transcribed gene regions where premature or alternate transcription ends may lead to various diseases. Although different methods and tools for in-silico prediction of genomic signals have been proposed, the correct identification of PAS in genomic DNA remains challenging due to a vast number of non-relevant hexamers identical to PAS hexamers. In this study, we developed a novel method for PAS recognition. The method is implemented in a hybrid PAS recognition model (HybPAS), which is based on deep neural networks (DNNs) and logistic regression models (LRMs). One of such models is developed for each of the 12 most frequent human PAS hexamers. DNN models appeared the best for eight PAS types (including the two most frequent PAS hexamers), while LRM appeared best for the remaining four PAS types. The new models use different combinations of signal processing-based, statistical, and sequence-based features as input. The results obtained on human genomic data show that HybPAS outperforms the well-tuned state-of-the-art Omni-PolyA models, reducing the classification error for different PAS hexamers by up to 57.35% for 10 out of 12 PAS types, with Omni-PolyA models being better for two PAS types. For the most frequent PAS types, 'AATAAA' and 'ATTAAA', HybPAS reduced the error rate by 35.14% and 34.48%, respectively. On average, HybPAS reduces the error by 30.29%. HybPAS is implemented partly in Python and in MATLAB available at https://github.com/EMANG-KAUST/PolyA_Prediction_LRM_DNN.

Highlights

  • Eukaryotic genomes have many important regions and signals, such as promoters, enhancers, transcription factor binding sites, translation start sites, splice sites, polyadenylation signals (PAS) and sites which define the gene regulatory landscape and demarcate gene boundaries

  • The main contribution of our study is the development of a hybrid machine learning (ML) model, hybrid PAS recognition model (HybPAS), that comprises a separate prediction model be it a deep neural networks (DNNs) or logistic regression models (LRMs), for each of the 12 most common PAS variants in the human genome

  • To find the best performing classification model for each of the 12 PAS hexamers, we developed several ML/ deep learning (DL) models (DNN, LRM, shallow Artificial Neural Network (ANN), decision tree (DT), Support Vector Machine (SVM)) and for each PAS motif we selected the model with the highest performance

Read more

Summary

Introduction

Eukaryotic genomes have many important regions and signals, such as promoters, enhancers, transcription factor binding sites, translation start sites, splice sites, polyadenylation signals (PAS) and sites which define the gene regulatory landscape and demarcate gene boundaries. We are of the opinion that suitably used signal processing methods can help in developing more efficient prediction models in bioinformatics and computational biology

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.