Abstract

BackgroundTime series gene expression data analysis is used widely to study the dynamics of various cell processes. Most of the time series data available today consist of few time points only, thus making the application of standard clustering techniques difficult.ResultsWe developed two new algorithms that are capable of extracting biological patterns from short time point series gene expression data. The two algorithms, ASTRO and MiMeSR, are inspired by the rank order preserving framework and the minimum mean squared residue approach, respectively. However, ASTRO and MiMeSR differ from previous approaches in that they take advantage of the relatively few number of time points in order to reduce the problem from NP-hard to linear. Tested on well-defined short time expression data, we found that our approaches are robust to noise, as well as to random patterns, and that they can correctly detect the temporal expression profile of relevant functional categories. Evaluation of our methods was performed using Gene Ontology (GO) annotations and chromatin immunoprecipitation (ChIP-chip) data.ConclusionOur approaches generally outperform both standard clustering algorithms and algorithms designed specifically for clustering of short time series gene expression data. Both algorithms are available at .

Highlights

  • Time series gene expression data analysis is used widely to study the dynamics of various cell processes

  • Robustness to noise To test the robustness of ASTRO and MiMeSR to noise, we generated three sets of data, 1000 rows and 3, 5, and 7 time points respectively, with five order preserving submatrix which at the same time verify the minimum mean squared residue property embedded in it

  • The p-values for these clusters were ranging from 10-10 to 10-34 for ASTRO (Table 1) and from 10-34 to 10-68 for MiMeSR (Table 2.) The results show that in general MiMeSR clusters are more homogenous than the ASTRO clusters regarding the Gene Ontology (GO) pathways

Read more

Summary

Introduction

Time series gene expression data analysis is used widely to study the dynamics of various cell processes. Most algorithms used to analyze time series datasets initially were based on general clustering methods like hierarchical clustering [5], k-means [6], Bayesian networks [7], and self-organizing maps [8] These methods are capable of revealing some biological features, they are not taking into consideration the sequential nature of the time series data. BMC Bioinformatics 2009, 10:255 http://www.biomedcentral.com/1471-2105/10/255 models [10], and others [11,12,13,14] Algorithms such as those developed by Bar-Joseph et al [9], De Hoon et al [12] and Peddada et al [13] perform better on long time series datasets where the statistical power is higher. For short time series data, which represent about 80% of the time series gene expression datasets [15], they are expected to perform less optimal due to data overfitting caused by the small number of sampled time points

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call