Abstract

Motivation: Accurate identification and delineation of promoters/TSSs (transcription start sites) is important for improving genome annotation and devising experiments to study and understand transcriptional regulation. Many promoter identifiers are developed for promoter identification. However, each promoter identifier has its own focuses and limitations, and we introduce an integration scheme to combine some identifiers together to gain a better prediction performance. Result: In this contribution, 8 promoter identifiers (Proscan, TSSG, TSSW, FirstEF, eponine, ProSOM, EP3, FPROM) are chosen for the investigation of integration. A feature selection method, called mRMR (Minimum Redundancy Maximum Relevance), is novelly transferred to promoter identifier selection by choosing a group of robust and complementing promoter identifiers. For comparison, four integration methods (SMV, WMV, SMV_IS, WMV_IS), from simple to complex, are developed to process a training dataset with 1400 se- quences and a testing dataset with 378 sequences. As a result, 5 identifiers (FPROM, FirstEF, TSSG, epo- nine, TSSW) are chosen by mRMR, and the integration of them achieves 70.08% and 67.83% correct prediction rates for a training dataset and a testing dataset respectively, which is better than any single identifier in which the best single one only achieves 59.32% and 61.78% for the training dataset and testing dataset respectively.

Highlights

  • IntroductionA short DNA sequence, is the binding site of RNA polymerases

  • Promoter, a short DNA sequence, is the binding site of RNA polymerases

  • Four integration methods Simple Majority Voting (SMV), Weighted Majority Voting (WMV), Simple Majority Voting plus Identifier Selection (SMV_IS) and Weighted Majority Voting plus Identifier Selection (WMV_IS), from simple to complex, are developed to process a training dataset with 1400 sequences and a testing dataset with 378 sequences

Read more

Summary

Introduction

A short DNA sequence, is the binding site of RNA polymerases. It determines the transcription start site (TSS). After RNA polymerase binding to a promoter, the promoter initiates the transcription and indicates where the transcription should start. In order to be recognized by the RNA polymerases, the structure of promoters is rather stable, e.g. in eukaryotic genome, many promoters contain TATA box, which can help locate promoters by searching TATA sequences. Besides TATA box, functional motifs, oligonucleotide composition and compositional features are used for promoter identification [1,2,3,4,5,6,7,8]. This paper investigates a novel way to combine some promoter identifiers together to improve the identification rate

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.