Abstract
As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ 70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ 70 promoter prediction methods.
Highlights
Transcription initiation is the first and key step leading to gene expression [1]
The development of reliable prokaryotic promoter region prediction methods is highly desirable for improving the accuracy of microbial genomes annotation tools
We evaluated several strategies for generating non-promoter sequences and showed that a more accurate estimate of the classifier performance could be obtained using negative data consisting of equal size subsets of sequences generated using multiple strategies or by generating multiple versions of the cross-validation data and use the average cross-validation performance over these data sets as the estimated cross-validation performance of the classifier
Summary
Transcription initiation is the first and key step leading to gene expression [1]. The process starts with the binding of RNA polymerase (RNAP) to a specific segment in DNA (called promoter region) located upstream of the transcription start site (TSS). Structure based features include: stress induced duplex destabilization (SIDD), DNA curvature and stacking energy explored in [13], roll, tilt, twist and average free energy used in [14], and DNA stability proposed in [23]; iii) Classification algorithms: support vector machines (SVMs) and artificial neural networks (ANNs) are widely used for this classification task; iv) Evaluation procedures: the vast majority of prediction methods [7, 8, 10, 12, 13, 15,16,17, 19, 26, 33] have been evaluated using cross-validation experiments. We propose a meta-predictor combining two sequence-based and structure-based predictors for predicting E. coli σ70 promoter regions and compare it with some state-of-the-art prediction methods
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.