The impact of sequence length and number of sequences on promoter prediction performance.

Sávio G Carvalho,Luiz H De C Merschmann,Renata Guerra-Sá

doi:10.1186/1471-2105-16-s19-s5

Sávio G Carvalho, Luiz H De C Merschmann + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-16-s19-s5

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Dec 1, 2015
Citations: 28	License type: cc-by

Affiliation: Universidade Federal de Ouro Preto

Abstract

BackgroundThe advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high-dimensional datasets can degrade the classifiers predictive performance or even require an infeasible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a systematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, we have evaluated the impact of sequence length variation and training dataset size (number of sequences) on the predictive performance of classifiers.ResultsWe have built sixteen datasets composed of different sized sequences (ranging in length from 12 to 301 nucleotides) and evaluated them using the SVM, Random Forest and k-NN classifiers. The best predictive performances reached by SVM and Random Forest remained relatively stable for datasets composed of sequences varying in length from 301 to 41 nucleotides, while k-NN achieved its best performance for the dataset composed of 101 nucleotides. We have also analyzed, using sequences composed of only 41 nucleotides, the impact of increasing the number of sequences in a dataset on the predictive performance of the same three classifiers. Datasets containing 14,000, 80,000, 100,000 and 120,000 sequences were built and evaluated. All classifiers achieved better predictive performance for datasets containing 80,000 sequences or more.ConclusionThe experimental results show that several datasets composed of shorter sequences achieved better predictive performance when compared with datasets composed of longer sequences, and also consumed a significantly shorter processing time. Furthermore, increasing the number of sequences in a dataset proved to be beneficial to the predictive power of classifiers.

Highlights

The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost
Despite the large amount of work involving promoter prediction [7,3,5,2,6,10,1], to the best of our knowledge, none of them have verified in a systematic way the relation between the length of sequences used for training classification models and their predictive performance
Impact of the variation of sequences length The results obtained from the experiments to verify the impact of the sequence length variation on the classifiers performance are shown in the Figure 4

Summary

Introduction

The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Advances in technology have allowed an acceleration of new genomes sequencing [1], evidencing the increasing demand for data analysis automation and for improving procedures previously used [2]. This has encouraged studying and implementing several computational techniques and creating new tools to enable processing of large amounts of genomic data. Further progress is needed to improve them [4,5,6,1]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The impact of sequence length and number of sequences on promoter prediction performance.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Influence of Sequence Length in Promoter Prediction Performance
Sávio G Carvalho ... Renata Guerra-Sá
-
Sávio G Carvalho, et. al.Sávio G Carvalho ... Renata Guerra-Sá
01 Jan 2014
01 Jan 2014

MULTI-FACTORIAL ANALYSIS OF CLASS PREDICTION ERROR: ESTIMATING OPTIMAL NUMBER OF BIOMARKERS FOR VARIOUS CLASSIFICATION RULES
Mizanur R Khondoker ... Jason Crain
Journal of Bioinformatics and Computational Biology | VOL. 08
Mizanur R Khondoker, et. al.Mizanur R Khondoker ... Jason Crain
01 Dec 2010
Journal of Bioinformatics and Computational Biology | VOL. 08

Multiple Orthogonal Sequence Subsets with Low In-Phase Cross-Correlation from the Shifted M-Sequence
Zhenyu Zhang ... Tianzuo Peng
-
Zhenyu Zhang, et. al.Zhenyu Zhang ... Tianzuo Peng
28 Apr 2018
28 Apr 2018

Incremental Wrapper Based Random Forest Gene Subset Selection for Tumor Discernment
Alia Fatima ... Aiman Khan Nazir
-
Alia Fatima, et. al.Alia Fatima ... Aiman Khan Nazir
01 Jan 2018
01 Jan 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The impact of sequence length and number of sequences on promoter prediction performance.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics