Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors.

Mostafa M Abbas,Yasser El-Manzalawy,Mostafa M Mohie-Eldin

doi:10.1371/journal.pone.0119721

Mostafa M Abbas, Yasser El-Manzalawy + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0119721

Copy DOI

Journal: PloS one	Publication Date: Mar 24, 2015
Citations: 44	License type: CC BY 4.0

Affiliation: Qatar University, Al Azhar University, Al-Azhar University

Abstract

As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ 70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ 70 promoter prediction methods.

Highlights

Transcription initiation is the first and key step leading to gene expression [1]
The development of reliable prokaryotic promoter region prediction methods is highly desirable for improving the accuracy of microbial genomes annotation tools
We evaluated several strategies for generating non-promoter sequences and showed that a more accurate estimate of the classifier performance could be obtained using negative data consisting of equal size subsets of sequences generated using multiple strategies or by generating multiple versions of the cross-validation data and use the average cross-validation performance over these data sets as the estimated cross-validation performance of the classifier

Summary

Introduction

Transcription initiation is the first and key step leading to gene expression [1]. The process starts with the binding of RNA polymerase (RNAP) to a specific segment in DNA (called promoter region) located upstream of the transcription start site (TSS). Structure based features include: stress induced duplex destabilization (SIDD), DNA curvature and stacking energy explored in [13], roll, tilt, twist and average free energy used in [14], and DNA stability proposed in [23]; iii) Classification algorithms: support vector machines (SVMs) and artificial neural networks (ANNs) are widely used for this classification task; iv) Evaluation procedures: the vast majority of prediction methods [7, 8, 10, 12, 13, 15,16,17, 19, 26, 33] have been evaluated using cross-validation experiments. We propose a meta-predictor combining two sequence-based and structure-based predictors for predicting E. coli σ70 promoter regions and compare it with some state-of-the-art prediction methods

Materials and Methods

Results and Discussion

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

The Listeria monocytogenes strain 10403S BioCyc database.
Renato H Orsi ... Kathryn J Boor
Database | VOL. 2015
Renato H Orsi, et. al.Renato H Orsi ... Kathryn J Boor
01 Jan 2015
The Listeria monocytogenes strain 10403S BioCyc database.
Renato H Orsi ... Kathryn J Boor

Brain-on-Cloud for automatic diagnosis of Alzheimer’s disease from 3D structural magnetic resonance whole-brain scans
Selene Tomassini ... Aldo Franco Dragoni
Computer Methods and Programs in Biomedicine | VOL. 227
Selene Tomassini, et. al.Selene Tomassini ... Aldo Franco Dragoni
27 Oct 2022
Computer Methods and Programs in Biomedicine | VOL. 227

Developing and verifying automatic detection of active pulmonary tuberculosis from multi-slice spiral CT images based on deep learning
Luyao Ma ... Stefan Jaeger
Journal of X-Ray Science and Technology | VOL. 28
Luyao Ma, et. al.Luyao Ma ... Stefan Jaeger
06 Jul 2020
Journal of X-Ray Science and Technology | VOL. 28

Machine-learning model for the prediction of acute orthostatic hypotension after levodopa administration.
Haimei Zhuang ... Zhu Liu
CNS neuroscience & therapeutics | VOL. 30
Haimei Zhuang, et. al.Haimei Zhuang ... Zhu Liu
01 Mar 2024
CNS neuroscience & therapeutics | VOL. 30

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one