Representative transcript sets for evaluating a translational initiation sites predictor

Jia Zeng,Douglas J Demetrick,Reda Alhajj

doi:10.1186/1471-2105-10-206

Jia Zeng, Douglas J Demetrick + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-10-206

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Jul 2, 2009
Citations: 1	License type: CC BY 2.0

Affiliation: University of Calgary, Global University

Abstract

BackgroundTranslational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics. In order to complete a comparative analysis, it is desirable to have several benchmark data sets which can be used to test the effectiveness of different algorithms. An ideal benchmark data set should be reliable, representative and readily available. Preferably, proteins encoded by members of the data set should also be representative of the protein population actually expressed in cellular specimens.ResultsIn this paper, we report a general algorithm for constructing a reliable sequence collection that only includes mRNA sequences whose corresponding protein products present an average profile of the general protein population of a given organism, with respect to three major structural parameters. Four representative transcript collections, each derived from a model organism, have been obtained following the algorithm we propose. Evaluation of these data sets shows that they are reasonable representations of the spectrum of proteins obtained from cellular proteomic studies. Six state-of-the-art predictors have been used to test the usefulness of the construction algorithm that we proposed. Comparative study which reports the predictors' performance on our data set as well as three other existing benchmark collections has demonstrated the actual merits of our data sets as benchmark testing collections.ConclusionThe proposed data set construction algorithm has demonstrated its property of being a general and widely applicable scheme. Our comparison with published proteomic studies has shown that the expression of our data set of transcripts generates a polypeptide population that is representative of that obtained from evaluation of biological specimens. Our data set thus represents "real world" transcripts that will allow more accurate evaluation of algorithms dedicated to identification of TISs, as well as other translational regulatory motifs within mRNA sequences. The algorithm proposed by us aims at compiling a redundancy-free data set by removing redundant copies of homologous proteins. The existence of such data sets may be useful for conducting statistical analyses of protein sequence-structure relations. At the current stage, our approach's focus is to obtain an "average" protein data set for any particular organism without posing much selection bias. However, with the three major protein structural parameters deeply integrated into the scheme, it would be a trivial task to extend the current method for obtaining a more selective protein data set, which may facilitate the study of some particular protein structure.

Highlights

Translational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics
Having a focus of presenting a general algorithm for selecting the transcript sequences that yieid a set of proteins satisfying particular parameter settings, we decide to adopt a sample scheme for configuring the molecular weight (MW) and pI parameters based upon our experience with the proteomic literature concerning human proteins where MW is within the range of 20–70 KDa and pI is between 5–9
We have presented a general procedure for constructing representative mRNA sequence collections for the purpose of testing translational initiation site recognition approaches

Summary

Introduction

Translational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics. In order to better evaluate the merit of a newly proposed approach, it is essential to conduct a thorough comparative study that involves several existing approaches This requires the existence of some high quality benchmark data sets that can be used for testing most existing methods. There exist only about four AUGs in one sequence (within a complete ORF, there are usually around 20 AUGs), which potentially leads to an overly optimistic estimation of the performance of a given algorithm Both data sets are downloadable from the Internet http://www.cbs.dtu.dk/services/NetStart/

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Representative transcript sets for evaluating a translational initiation sites predictor

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Simulation on Dynamic Characteristics of Micro Magnetic Acceleration Switch Based on Finite Element Method
...
-
, et. al. ...
01 Jan 2007
01 Jan 2007

Comparison of Antireflective Characteristics between Square Pillar and Columned Sub-Wavelength Structures
Qiyuan Xu ... Qian Wu
-
Qiyuan Xu, et. al.Qiyuan Xu ... Qian Wu
01 May 2011
01 May 2011

Deciphering Benzene–Heterocycle Stacking Interaction Impact on the Electronic Structures and Photophysical Properties of Tetraphenylethene-Cored Foldamers
Zeyan Zhuang ... Jinshi Li
CCS Chemistry | VOL. 4
Zeyan Zhuang, et. al.Zeyan Zhuang ... Jinshi Li
29 Mar 2021
CCS Chemistry | VOL. 4

Hon-yaku: a biology-driven Bayesian methodology for identifying translation initiation sites in prokaryotes.
Yuko Makita ... Antoine Danchin
BMC Bioinformatics | VOL. 8
Yuko Makita, et. al.Yuko Makita ... Antoine Danchin
08 Feb 2007
BMC Bioinformatics | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Representative transcript sets for evaluating a translational initiation sites predictor

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics